'progress report'에 해당되는 글 24건

Weekly Report (11/9)

Lab.work 2009. 11. 10. 12:07
@ Phoenix

1) Ideal vs. Mesh
 
The CPI gap between ideal and mesh is not so big.
I’m afraid even if we optimize mapreduce on the mesh network, 
It may not represent much to the performance.

2) miss rate of each phase
 
I used 64K 4way L1 cache and 1M 4way L2 cache(shared) for each node.
Garnet does not show the actual miss hit/rate, so we should compare the misses per 1000 inst.

3) injection rate of each node with time.
The result files are attached.
[msg.png] represents the injection rate of coherence messages to the processing time.
[data.png] represents the injection rate of coherence data.

We can see the diagonal lines as expected,
But, at the same time, some nodes are heavily used than other.

@ Booksim

1) injection voq
There was a little problem at the last result.
I fixed the program and the new result is attached.
[3000cycle.xlsx]

Voq is 3~4 times better than no-voq when the injection rate is very high.

2) test on 64 nodes
I also tested 8x8 mesh, 64 ring, and 4ary 3 fly.
For fly topology, voq is 8~9 times better than no-voq.
for mesh and torus, the simulation cannot be done to the high injection rate,
we cannot see the difference between voq and no-voq.

3) A New Scalable and Cost-Effective Congestion Management Strategy for Lossless Multistage Interconnection Networks  (HPCA 05)
Now I am trying to re-implement the system in the paper.

Thanks.
minjeong

'Lab.work' 카테고리의 다른 글

Meeting 12월 29일  (0) 2009.12.29
11월 11일 Meeting  (0) 2009.11.11
11월 4일 Meeting  (0) 2009.11.04
Weekly Report  (0) 2009.11.04
10월 28일 Meeting  (0) 2009.10.28
블로그 이미지

민둥

,

Weekly Report

Lab.work 2009. 11. 4. 19:05
@ phoenix

Network traffic patterns that I showed you last week are wrong, because my code had the garnet error.
I ran it again, and the number of message is much more than before, but the pattern is still similar.

I am also counting the packets every 1,000,000 cycle to get the average injection rate each cycle.
It may take 2 or more day to finish.

@ Booksim

I implemented the injection queue using voq.
And tested using bufsize=16 and # of vcs=16, packets were counted only between 3000~6000 cycle.
The packets are randomly selected from the injection queue, if the queue is not empty.
The result file is attached.

'Lab.work' 카테고리의 다른 글

Weekly Report (11/9)  (0) 2009.11.10
11월 4일 Meeting  (0) 2009.11.04
10월 28일 Meeting  (0) 2009.10.28
Weekly Report (10/27)  (0) 2009.10.28
Booksim VOQ 문제 해결  (0) 2009.10.27
블로그 이미지

민둥

,

Weekly Report (10/27)

Lab.work 2009. 10. 28. 18:30
@ phoenix

Last Thursday, I made the phoenix presentation and we discussed about that.

And I printed out the network traffic of phoenix (wordcount) on 16 CMP mesh topology using garnet network.
In the attached graph (network_traffic.pdf), the horizontal axis represents the source nodes and the vertical axis represents the destinations. 
the result is the amount of traffic from beginning to end of the program, and it does not show specific pattern.

I think I should use magic breaks and find the traffic patterns of map/reduce/merge phase.

@ Booksim

As I mentioned earlier, voq result finally gets better as we expected.

I only counted the accepted packets between 3000 and 6000 cycle. and then, calculated the throughput. (warm-up takes 3000 cycles)
I also used the infinite buffer (1,000,000,000 buffer).
because if the buffer size is small, the first packet in the injection queue will be blocked and voq cannot process other packets.

(3000cycle_infinite_buf.xlsx)
I tested more using UR traffic and UR+50%hotspot traffic,
If more portion of packet goes to the hotspot node, voq performs much better than novoq, of course.
But when I use UR traffic without hotspot, voq is not always better.

Thanks.
minjeong

'Lab.work' 카테고리의 다른 글

Weekly Report  (0) 2009.11.04
10월 28일 Meeting  (0) 2009.10.28
Booksim VOQ 문제 해결  (0) 2009.10.27
Booksim Accepted Throughput  (0) 2009.10.21
Weekly Report (10/20)  (0) 2009.10.21
블로그 이미지

민둥

,

Weekly Report (10/20)

Lab.work 2009. 10. 21. 15:35
@ Phoenix 

I attached the ppt file for presentation. 
A summary and the time taken by each simulation are added. 

@ Booksim 

with the fly topology, overall accepted throughput of voq is not better than voq. (similar to mesh and ring) 
but before that, I am wondering that the overall average latency of voq(vc=16 buf=16) is larger than novoq(vc=1, buf=16), especially for fly topology. This happens when packet size = 4 or using 25% hotspot traffic. 
I tested same thing for 64node-fly and the result is the same regardless of buf size. 
It might be a simulation error or not, I should figure it out. 

Thanks. 
minjeong

'Lab.work' 카테고리의 다른 글

Booksim VOQ 문제 해결  (0) 2009.10.27
Booksim Accepted Throughput  (0) 2009.10.21
Weekly Report (10/12)  (0) 2009.10.12
Weekly report (10/5)  (0) 2009.10.12
Weekly report (9/28)  (0) 2009.09.29
블로그 이미지

민둥

,

Weekly Report (10/12)

Lab.work 2009. 10. 12. 23:01
@ Booksim

i ran the simulator with age priority.
but the problem was not the old packets.
[voq_age_hotspot.pdf] (already sent)

and I expanded the mesh size to 8x8 and ran the simulator.
i found out that if a node is close to the hotspot, average latency of voq is less than no-voq.
But a node is far from the hotspot, average latency of voq becomes larger.
[mesh88_voq.xlsx] [mesh88_voq_accepted_packet.xlsx] (already sent)

about the tests on the ring topology:
to prevent routing deadlock, I used twice as many vcs as the number of nodes.
for both 16ring and 64ring, Total # of accepted packets of voq is more than no-voq. 
But if I normalize the data to packets/cycle, the voq result is not better anymore. 
[ring16_voq.xlsx][ring64_voq.xlsx][ring16_voq_accepted_packet.xlsx] (already sent)

to find the reason, i drew the packet flow roughly and compared voq and no-voq, 
where both have 16 vcs and 16 buf, but there is 25% of packet blocking at no-voq.
i'm not sure what i drew is right though, but i can find from the diagram that 
1. the cycle taken was the same.
2. the buf of voq is saturated faster than no-voq.
i should test 8x8 mesh either.
and if these might be the reason, i think i need to see buf status again.
[packet_flow.pptx]

@ CS710 project

we counted # of messages for every sort of coherence message type, and plotted 3D graph.
now we are trying to find the pattern of each type and what topology is suit for the each message type.

thanks. 

'Lab.work' 카테고리의 다른 글

Booksim Accepted Throughput  (0) 2009.10.21
Weekly Report (10/20)  (0) 2009.10.21
Weekly report (10/5)  (0) 2009.10.12
Weekly report (9/28)  (0) 2009.09.29
Progress Report (9/15)  (0) 2009.09.15
블로그 이미지

민둥

,

Weekly report (10/5)

Lab.work 2009. 10. 12. 22:33
for report, 

@ booksim 

25% hotspot(uniform random) traffic is saturated when injection rate=0.05~0.06, 
and 50% hotspot traffic is saturated when injection rate=0.03. 

the average latency of packets from src=0,4,8,12 to dest!=0: 
voq works better than no-voq at the saturation point. 
The results at other injection rates seem meaningless. (voq_hotspot.xlsx). 

@ phoenix 
I added the page that explains the differences between phoenix ver1.0 vs. ver2.0 
And tested the actual processing time of wordcount application with ruby. 
Wordcount with 16core takes 2~3hours to finish with ruby(p2p default).


'Lab.work' 카테고리의 다른 글

Weekly Report (10/20)  (0) 2009.10.21
Weekly Report (10/12)  (0) 2009.10.12
Weekly report (9/28)  (0) 2009.09.29
Progress Report (9/15)  (0) 2009.09.15
Progress Report (9/8)  (0) 2009.09.08
블로그 이미지

민둥

,

Weekly report (9/28)

Lab.work 2009. 9. 29. 00:36
Hi. 

First of all, I apologize for late report every week :'( 
I will be more punctual next time. 

@ mapreduce ppt 

The PPT is almost done except for the summary part. 
I thought about what to write, but I think I need to discuss about the new results. 

I spent time to explain the mapreduce concept first, 
And compared the result in the paper with my test result. (ver1.0 and ver2.0 on sunT2 machine) 
Then, I added the dynamic load balancing and execution time breakdown. 
And I divided the test configurations as 4 part, which are input size, thread distribution, cache size, and page size. 

Actually, I tested page size this week, and found that: 
When I changed the page size to 64K (default is 8K), there was no degradation for wordcount. 

So I am testing if I change the page size, 
whether the degradation will disappear on simics result. (where there was degradation after 16 nodes) 

@ Booksim 

I printed the buffer size for each cycle, and found that the buffers of voq were saturated much faster than no-voq. 
I thought that it was because the total amount of packets of voq is more than no-voq. 

So I printed the latency of each result only whey the cycle=5000, 2000. 
then, the voq finally printed out better result than no-voq. 
(and I checked that voq is better than no-voq for the same cycles. 
But final # cycle of voq is twice more than no-voq because of the # of packets.) 

I will test with varying buf size and compare the results. 

Thanks. 
minjeong

'Lab.work' 카테고리의 다른 글

Weekly Report (10/12)  (0) 2009.10.12
Weekly report (10/5)  (0) 2009.10.12
Progress Report (9/15)  (0) 2009.09.15
Progress Report (9/8)  (0) 2009.09.08
PROGRESS REPORT (9/1)  (0) 2009.09.01
블로그 이미지

민둥

,

Progress Report (9/15)

Lab.work 2009. 9. 15. 15:24
■ THIS WEEK  (9/8~9/14)

□ CS 710 project

I compared the cpi of IDEAL, MESH, PT_TO_PT.
Where PT_TO_TO is the topology that garnet offers defaultly (it is almost ideal).
MESH is the 4x4 mesh topology, 
IDEAL uses the same link latency and config with MESH but connected in a p2p manner.

For 3 application: barnes, cont-ocean, noncont-ocean,
The cpi result is PT_TO_PT < IDEAL < MESH
But the gap between them is not so big.
[mesh_vs_ideal.xlsx]

TODO: i think that the network topology does not seriously affect to the performance in splash2 applications
We might need to test specjbb or other and compare the results.

□ Booksim

I plotted 3-D graph of accepted packets for each nodes, but couldn’t find the specific pattern.
[booksim_hotspot.xlsx]

TODO: I want to fix the trafficmanager code to inject the same amount of packets for voq and no-voq.
and then, compare the results.

□ Phoenix

I changed the wordcount_map function to count all words.
(before, it only counted the words which consist of a-z)

The speedup pattern of map phase is not changed,
But the reduce time hugely increases.
From the result, the reduce processing time depends on the amount of words.

Actually, 
For the mapreduce version of wordcount, map threads store all words to the intermediate queue,
after that, reduce threads read words from the intermediate queue.
For the pthreads version, there is no intermediate queue between map and reduce. 
So, every word passes map and reduce at a time.
[test_on_gems.xlsx]

TODO: I think I need to check the size of intermediate queue, 
and check the latency of store, read words.

'Lab.work' 카테고리의 다른 글

Weekly report (10/5)  (0) 2009.10.12
Weekly report (9/28)  (0) 2009.09.29
Progress Report (9/8)  (0) 2009.09.08
PROGRESS REPORT (9/1)  (0) 2009.09.01
PROGRESS REPORT (8/26)  (0) 2009.08.26
블로그 이미지

민둥

,

Progress Report (9/8)

Lab.work 2009. 9. 8. 14:57
■ THIS WEEK  (9/1~9/7)

□ Booksim

i tested the same process with randperm traffic, but the result is almost similar.
now i am trying to print out the hot spot latency and other latency, and compare them.

□ Phoenix

* degradation of map time
i thought that the reason of degradation was the disk io time.
for shared memory system, cache miss rate increases when # of processors increases.

* test with the sunT2 stats
using 8KB private L1 data cache for each processor and 4MB shared L2 cache.
the speedup degradation appears after 16 cores.
and the map time usually dominates the whole processing time.

* test with the perfect cache
using 100MB private L1 data cache and 400MB shared L2 cache. (other configs are the same as sunT2)
the L1 read hit rate of each processor is almost 99% (it's not 100% because of the cold misses)
but the write hit rate slightly decreases as the number of the processors increases.
the overall performance of the processors increases but the pattern of the speedup is the same.
there still exists the degradation after 16 cores and the map time is a dominative part.

i think that i need to test other application and compare the results.
and i try to focus on working set size of each application.

'Lab.work' 카테고리의 다른 글

Weekly report (9/28)  (0) 2009.09.29
Progress Report (9/15)  (0) 2009.09.15
PROGRESS REPORT (9/1)  (0) 2009.09.01
PROGRESS REPORT (8/26)  (0) 2009.08.26
PROGRESS REPORT (8/17)  (0) 2009.08.17
블로그 이미지

민둥

,

PROGRESS REPORT (9/1)

Lab.work 2009. 9. 1. 09:38
■ THIS WEEK  (8/27~9/1)

□ Booksim

* 50% of packet; dest 0, 50%: uniform random
voq throughput never exceeds the no voq result.
i counted the number of time when a vc was full and no flit couln't be sent to the vc.
using voq, many flits need to wait for 0 vc.
when bufsize is infinity, throughput of voq is also low.
because the waiting time at 0 vc is very long. 

□ Phoenix

* degradation of map time
i think that the reason of degradation is because of the disk io time.
for shared memory system, cache miss rate can increases when # of processors increases.
when input size gets bigger, the gradation can become large.
actually, L1 cache miss rate increases after 8 threads.

'Lab.work' 카테고리의 다른 글

Progress Report (9/15)  (0) 2009.09.15
Progress Report (9/8)  (0) 2009.09.08
PROGRESS REPORT (8/26)  (0) 2009.08.26
PROGRESS REPORT (8/17)  (0) 2009.08.17
PROGRESS REPORT (8/10)  (0) 2009.08.11
블로그 이미지

민둥

,