To achieve high performance on MPSoC via parallelism, a key issue is how to partition a given...

Question

To achieve high performance on MPSoC via parallelism, a key issue is how to partition a given application into different components and map them onto multiple processors. In this paper, we propose a...

1 answer below »

To achieve high performance on MPSoC via parallelism,
a key issue is how to partition a given application into
different components and map them onto multiple
processors. In this paper, we propose a software pipeline-
ased partitioning method with cyclic dependent task
management and communication optimization. During
task partition, considering computation load balance and
communication overhead optimization at the same time
can cause interference, which leads to performance loss.
To address this issue, we formulate their constraints and
apply the Integer Linear Programming (ILP) approach to
find an optimal partitioning result trading off these two
factors. Experimental results on a reconfigurable MPSoC
platform demonstrate the effectiveness of the proposed
method, with 20% to 40% performance improvements
compared to traditional software pipeline partition
method.

Keywords: software pipeline, partition, cyclic dependent
task management, communication optimization

Manuscript received Aril. 22, 2014; revised Dec. 24, 2015; accepted Jan. 2, 2015.
Kai Huang ( XXXXXXXXXX), Siwen Xiu (co
esponding author,
XXXXXXXXXX), Min Yu ( XXXXXXXXXX), Xiaomeng Zhang
( XXXXXXXXXX), and Xiaolang Yan ( XXXXXXXXXX) are with the Department
of Information Science & Electronic Eng, Zhejiang University, Zhejiang, China.
Rongjie Yan ( XXXXXXXXXX) is with the Department of Information Science & Electronic
Eng, Zhejiang University, Zhejiang, China.
Zhili Liu ( XXXXXXXXXX) is the marketing team, Hangzhou C-Sky Microsystem
Co.,Ltd., Zhejiang, China.
I. Introduction
The increasing demands for high performance of embedded
systems promote the extensive use of Multiprocessor System-
on-Chip (MPSoC). Given an application, one key issue of
generating efficient parallel codes for a target MPSoC platform
is how to partition it into different components and map them
onto different processors with the best performance. As a
prevalent parallelization method, software pipeline is an
effective solution to address this problem. For software
programs, pipelining introduces higher degree of parallelism to
increase the program throughput. For hardware processors, the
pipelined stages make it easy to partition and map the
decomposed programs onto different components to achieve
etter hardware utilization [1].
However, the increasing complexity of applications and
hardware architecture challenges the efficiency of software
pipeline. To explore the parallelism of software application and
hardware architecture, the software pipeline technique has to
face the following two issues:
How to keep balanced workloads as well as maintain task
dependency. High parallelism calls for a balanced pipeline
where each stage has almost the same execution time as well as
linear stage dependency. However, most of the existing
applications involving complicated cyclic task dependencies
may constrain the distribution of tasks among processors,
which makes it harder to keep balanced workloads among
pipelined stages without destroying task dependencies.
How to minimize communication overheads. With the
increasing complexity of MPSoC, inter-stage communication
is becoming an ineligible factor for software pipeline.
Decomposing a task into finer-grained subtasks results in
higher overhead in synchronizing subtasks, with lower system
Software Partitioning Method Trading off
Load Balance and Communication Optimization
Kai Huang, Siwen Xiu, Min Yu, Xiaomeng Zhang, Rongjie Yan, Xiaolang Yan, and Zhili Liu
교 정: 초벌편집
파 일: 김수영(12-21)
The article has been accepted for inclusion in a future issue of ETRI Journal, but has not been fully edited. Content may change prior to final publication.
http:
dx.doi.org/10.4218/etrij XXXXXXXXXX
RP1404-0502e © 2015 ETRI 1
performance and scalability [2]. Thus, how to reduce the
communication overhead between software pipeline stages
should also be emphasized.
In some cases, both issues may be interference with each
other, making software pipeline construction harder.
Communication Pipeline [3] is a communication optimization
technique that can significantly hide communication transfer
time between processors. But its additional latency may impact
on the handling of cyclic dependent tasks and cause
nonadjustable imbalanced workloads. Therefore, we have to
maintain a trade-off between workload balance and
communication optimization techniques for better parallelism.
In this paper, we propose a software pipeline-based
partitioning method with cyclic dependent task management
and communication cost minimization. The interference
etween communication optimizations and workload balance
is well addressed for better performance. We first analyze how
to partition general pipeline stages in cyclic dependency
topology. Next, we quantify the inter-stage communication
pipeline optimization on software pipeline partition, and then
formulate these constraints in our Integer Linear Programming
(ILP) models to trade off software pipeline for a better
partitioning result. Finally, each pipeline stage is mapped to one
processor.
The main contributions of this paper are summarized as
follows: First, the proposed method combines both software
pipeline and communication pipeline techniques to balance
computation load and reduce communication overhead. For
the first time, the cyclic constraint for general software pipeline
technique is investigated and two kinds of pipelines are
combined and executed well. Second, the software pipeline
ased partitioning and mapping method is integrated into
Simulink-based MPSoC multithreaded code generation flow,
which implements the automatic generation of efficient parallel
code from sequential applications.
The rest of the paper is organized as follows: Section II
gives some related works. Section III describes the background
of the Simulink model, software pipeline and communication
pipeline. Section IV introduces the proposed mapping method.
Section V shows the feasibility of the implementation of our
method. Section VI demonstrates the experiments and
discusses the results. Section VII concludes this paper and
highlights the directions for future work.
II. Related work
Cu
ent literature offers plenty of methods on code
generation from high-level models. Most methods are based on
functional modeling such as Khan Process Network (KPN) [4],
dataflow [5], UML [6] and Simulink [7]. As a prevalent
environment for modeling and simulating complex systems at
an algorithmic level of abstraction, Simulink has been widely
used, such as in Real-Time Workshop (RTW) [8], dSpace [9]
and many other code generators [10]-[11]. LESCEA (Light
and Efficient Simulink Compiler for Embedded Application)
[12] is an automatic code generation tool with memory-
oriented optimization techniques. Nevertheless, the partitioning
and mapping of an application in LESCEA is conducted
manually, which requires expertise and significantly affects the
performance of the generated codes.
The high performance requirements of embedded
applications necessitate the need to realize efficient partitioning
and mapping methods. Much literature can be found to tackle
the problem. For example, search based approaches are
extensively used, such as Simulated Annealing (SA) in [13],
ILP in [14], which can achieve optimal or near-optimal
solutions. Further, performance metrics such as
communication latency, memory, energy consumption and so
on are optimized along with the mapping methods (please refer
to [15] for more details).
As one of the parallelization methods, software pipeline is
widely studied. Cyclic task dependency is an important factor
that limits the performance of a software pipeline. In [16]-[18],
all of the three approaches exploit the retiming technique to
transform intra-iteration task dependency into inter-iteration
task dependency to implement a task-level coarse-grained
software pipeline. However, communication is not fully
considered in these works. In [19], the authors construct a
software pipeline for steaming applications where
communication is optimized through laying buffers in
communication channels. As a result, sending and receiving
can be operated independently to avoid synchronization
overhead, which is similar to our work. In [20], the partitioned
streaming application is assigned to pipeline stages in such a
way that all communication (DMA) is maximally overlapped
with computation on the cores. Nevertheless, the assumption
that the whole streaming application model has no feedback
loops limits the utilization of the software pipeline in real-life
applications.
ILP is a well-known approach for the ability to calculate
optimal results for partitioning problems. It is also applied to
generate software pipelines. ILP is exploited in [20] to
determine the assignment of synchronous dataflow actors to
pipeline stages co
esponding to processors to minimize the
maximal load of any processor. In [21], an ILP formulation is
utilized to search a smaller design space and find an
appropriate configuration for ASIPs, with minimized system
area as well as satisfying system runtime constraint in pipelined
processors. An ILP based mapping approach is presented in
The article has been accepted for inclusion in a future issue of ETRI Journal, but has not been fully edited. Content may change prior to final publication.
http:
dx.doi.org/10.4218/etrij XXXXXXXXXX
RP1404-0502e © 2015 ETRI 2
[22] to minimize the most expensive path in a pipeline under
the constraint of program dependency and the maximal
number of concu
ently executed components. These methods
also have less consideration on any or both of the discussed
two factors in software pipeline.
Previous works have implemented software pipeline in
various ways and integrated certain optimizations on cyclic
task dependencies or communications respectively. In this
paper, we consider both cyclic task dependency and
communication overheads when trading off computation and
communication, and integrate the techniques handling the two
problems into our later software pipeline partitions. We utilize
ILP formulations to quantify and combine the above two
factors in order to obtain higher performance.
III. Background
1. Simulink model
This work is based on the concepts of Simulink models,
which have been introduced in the previous works [12], [23]-
[24]. A Simulink model represents the functionality of the
target system with software function and hardware architecture.
It has the following three types of basic components.
Simulink Block represents a function that takes n inputs and
produces certain outputs. Examples include user-defined (S-
function), discrete delay, and pre-defined blocks such as
mathematical operations. For the ease of discussion, we mainly

article2-13b1a3i0-bc5jrrxa.pdf article1-n40cvyqj-ofl1ar3d.pdf article3-rlrkmshj-qznznjzp.pdf template-gljlvqgs-wx0xshs0.pdf

Answered Same Day Nov 24, 2021

Solution

Deepti answered on Nov 30 2021

141 Votes

Comparative Analysis of Performance via Parallelism
Research Paper on Computer Architecture
Abstract
This paper provides a comparative analysis of three articles on the topic Performance via Parallelism. The overall purpose of this paper is to establish the platform for comparison where parallelism is used with big data. It states the comparative analysis of the methods proposed in each paper for performance improvement through parallelism. It further elaborates on the effectiveness of the methods used in the papers for performance improvement through parallel computing. The similarities and differences have been listed and the relationship among the three approaches is indicated in a precise manner. The paper is concluded with the best out of three approaches which proves itself to be more effective than the other two.
Keywords: Optimization, Pipelining, task management, performance, ILP (Integer Linear Programming, data transfer, parallel computing
Table of Contents
Abstract…………………………………………………………………2
Introduction……………………………………………………………..3
Comparative Analysis..……..…………………………………………..4
Platform of Comparison..……………………………………………….5
Effectiveness of Methods………………………………………………..5
Conclusion……………………………………………………………….6
References….…………………………………………………………….6
1. Introduction
In the communities of high- performance computing, the in-memory computing framework is witnessing constant developments and enrichment.
(Changtian Ying, 2018) states that for the in- memory framework, the parallelism degree is difficult to be adapted. It may be ignored as it may to avoid affecting the execution efficiency of a job or affecting resource utilization rate. But if the resource allocation is better managed, then the memory allocation and better job execution can be achieved. (Kai Huang) proposes partitioning method based on software pipeline for improving performance. This method involves communication optimization and cyclic dependent task management. It obtains best partitioning results by using Integer Linear Programming Approach (ILP). (Eun-Sung) evaluates data transfer over WAN on the basis of parallel and cross-layer optimization techniques. It exploits data, task and pipeline parallelisms over the three layers in data transfer (network layer, application layer and storage layer. It then proposes cross layer optimization for better performance.
2....

SOLUTION.PDF

To achieve high performance on MPSoC via parallelism, a key issue is how to partition a given application into different components and map them onto multiple processors. In this paper, we propose a...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment