Massively Parallel Programming with Parallaxis ---------------------------------------------- Thomas Braunl Universitaet Stuttgart, IPVR Breitwiesenstr. 20-22, D-7000 Stuttgart 80, FRG ### This paper only partially covers version 1 of the Parallaxis system, ### no version 2 extensions are being discussed! Abstract In the Parallaxis programming language, the model of a parallel architecture is included as an integral part of the problem solution. That is, the combination of algorithm and machine model accounts for an entire specification. Since the choice of the computer architecture largely determines the structure of the algorithm applied, a structured parallel programming language should be given the expressive power of stating, or even better, selecting the structure of the parallel architecture, for which an algorithm is bound. Our language model Parallaxis operates on the class of SIMD array processors. In addition to the concurrent algorithm, Parallaxis allows the specification of arbitrary network topologies by means of a functional description with configuration and connection specifications. Simple concepts for concurrent execution of statements and message passing are based on these definitions. Variable declaration takes the parallel machine model into account and therefore splits into variables for the control unit (scalars) and variables for each of the parallel processing units (vectors). 1. Introduction The Parallaxis language model was designed to allow structured parallel programming in a high level language, similar to sequential Pascal or Modula-2. The compiler should be ma- chine-independent, so it can be used for a wide range of parallel architectures. In addition, it should be useful in exploiting the parallel resources of any particular architecture satisfying the basic machine model, in order to achieve optimal performance. Both goals (which might look contradictory, at first) are achieved by a translation into a low level intermediate parallel lan- guage which is processed by a machine-dependent interpreter or compiler. This flexible model is restricted to SIMD structures; it combines the internal representation of the hardware struc- ture with the topology- or structure-specific algorithm to form a complete problem solution. Thus, it is possible to specify number and arrangement of processing elements, as well as their communication network in Parallaxis. Here, we get a new meaning of the term structured. The whole problem solution is structured, including the algorithm, as well as the architecture for which the algorithm is bound. Given this semantics, it is necessary to include a machine description, if the language should not be re- stricted to a single architecture that is implicitly assumed. An algorithm for the same problem will look totally different if designed for a hypercube instead of a ring-topology. Besides, for a certain problem there may be structures well suited and others that do not match very well. If there is a choice (as for reconfigurable parallel systems), the architecture specification can be used for creating the best matching architecture. In this paper, I will address the language concepts supporting the Parallaxis user in pro- gramming a SIMD computer on a high level of abstraction. I will not be concerned at this point, however, what strategy or algorithm to apply in order to perform an efficient mapping between a topology specified for a certain application and the topology provided by the physi- cal hardware structure. This problem is handled automatically on a lower operating system level, transparent to the user. However, it will greatly affect the efficiency of an application with heavy communication. H. J. Siegel developed efficient mapping algorithms for simulating one topology on a different topology. Corresponding to the machine model of a processor-network, controlled by a single control unit, variables may be declared either for the control unit (using the declaration keyword scalar), or for each of the PEs (using vector). Each variable is strictly typed, like in Modula-2, so there may be no vector variables in an expression that is to be assigned to a scalar variable. Vectors may be used inside explicitly marked parallel blocks or operations only while scalars may also appear in parallel vector expressions (requiring a duplication or broadcast of that par- ticular value). 2. Specification of Network Topologies Parallaxis makes the following assumptions about its abstract parallel machine, whether it is physically present or simulated: * SIMD structure with - central control unit - variable number of processing elements (PEs) - flexible (reroutable) communication network that is, it can be used as if forming any topological structure * all operations occur synchronously * all PEs are identical in processor and memory structure * each PE owns the same number of bi-directional ports for sending and receiving messages * one process per PE Within the realm of this machine model, the programmer may now choose a certain topological structure for the communication network to achieve the best possible match between ar- chitecture and algorithm of a given problem. For each application, the number of PEs and the network topology is static, that is, it has to be specified prior to compilation and will remain unchanged during program execution. The specification of the network structure's logical form takes two steps: First, Parallaxis is told the number of processing elements one wants to use and how they will be arranged in di- mensions. This almost exactly takes the form of Modula-2's array declaration, except that we are dealing with processors here instead of data elements. Although we create some neighbor relation between the PEs by this declaration, this does not specify any connections between processors which is reserved for the second step. There, we may specify a transfer-function (from a general PE position to its relative neighbor) for every processor-port in this topology. Or, stated differently, the number of transfer-functions provided equals the number of ports per PE. Each transfer-function has a name (the exit-port's name) and also states the name of the corresponding entry port on its neighbor-PE after a period. Parallaxis allows the specification of arbitrary network topologies. The examples shown in the following sections should be guidelines for modeling other topologies. 2.1 Ring Structure The simple topological structure of a ring may be specified as follows in Parallaxis: CONFIGURATION ring [12]; CONNECTION c_wise : ring[i] -> ring[(i+1) mod 12].cc_wise; cc_wise: ring[i] -> ring[(i-1) mod 12].c_wise; 11 --- 0 --- 1 | | 10 2 | | 9 3 <-- PE --> | | cc_wise c_wise 8 4 | | 7 --- 6 --- 5 Figure 1: ring topology With the configuration specification the user tells the system to reserve twelve processors, la- beled zero to eleven, that are arranged in a single dimension. Each PE owns two ports, called c_wise (for clockwise mapping) and cc_wise (for counter-clockwise mapping), corresponding to these two transfer-functions. While the function c_wise maps any processing element to the next higher PE, the function cc_wise maps any element to the next lower one. Using the mod- ulo-operator results in a closed topology that is, every element has a neighbor at every port. 2.2 Two-Dimensional Grids Now, let us take a look at some simple two-dimensional network structures: the grid, an open topology that might be useful in application areas ranging from image processing to air flow analysis for wing profile design, and its closed counterpart the torus (see figure 2). ^ ^ ^ ^ ^ | | | | | grid X -- X -- X -- X -- X torus <- X -- X -- X -- X -- X -> | | | | | (wrapped | | | | | X -- X -- X -- X -- X around) <- X -- X -- X -- X -- X -> | | | | | | | | | | X -- X -- X -- X -- X <- X -- X -- X -- X -- X -> | | | | | | | | | | X -- X -- X -- X -- X <- X -- X -- X -- X -- X -> | | | | | v v v v v Figure 2: two-dimensional grid and torus Their specification in Parallaxis is straightforward, as shown in figure 3. It takes just four transfer-functions to specify the whole grid with each PE owning four data-ports (here called: north, south, east, and west). The torus specification is almost identical; the additional modulo- operator connects opposite boundary-PEs. When considering the grid specification, there are PEs whose neighbor function evaluates to an invalid PE-number: e.g. the PE at position [2,4] would have a right neighbor [2,5], but this is out of bounds as of the grid size fixed in the CONNECTION specification (rows labeled 0 to 3, and columns labeled 0 to 4). By definition, these PEs (called "boundary-PEs") simply do not have a connection in that particular direction. Any data exchange to or from the outside will have no effect (see section 4 for more detail). CONFIGURATION grid [4] [5]; CONNECTION north: grid[i,j] -> grid[i+1, j].south; south: grid[i,j] -> grid[i-1, j].north; east : grid[i,j] -> grid[i, j+1].west; west : grid[i,j] -> grid[i, j-1].east; CONFIGURATION torus [4] [5]; CONNECTION north: torus[i,j] -> torus[(i+1) mod 4, j].south; south: torus[i,j] -> torus[(i-1) mod 4, j].north; east : torus[i,j] -> torus[i, (j+1) mod 5].west; west : torus[i,j] -> torus[i, (j-1) mod 5].east; Figure 3: grid and torus specification 3. Variable Declaration As seen before, we have to deal with a basically two-fold system structure: * a controlling host and * a network of processing elements. Variables that are used for controlling the execution sequence, such as a counting variable in a for-loop, should exist only once at the host computer while a variable used for vector compu- tations might be required once for each PE. To distinguish these two types of variable declara- tions, Parallaxis offers the phrases scalar for declaring a variable at the host, and vector for declaring a variable to appear in every PE's memory structure, thus overall creating a vector (or some higher order data structure, like a matrix, etc., depending on the specified network topo- logy). An example follows: SCALAR i,j: integer; VECTOR a,b: real; c,d: integer; The distinction between scalar and vector variables is also necessary for formal procedure pa- rameters and local variable declarations. 4. Parallel Programming Concepts Parallaxis provides language primitives to account for the parallel machine model which build on the topological structure specified by means of configuration and connection. There are three basic concepts for parallel processing in Parallaxis: 1. parallel execution ("parallel block") 2. parallel data exchange ("propagate operation") 3. vector reduction ("reduce function") 4.1 Parallel Execution In analogy to the begin-end block used to group statements to be executed sequentially, there is the parallel-endparallel block for synchronous parallel statements. The semantics hereby is that every processing element executes the same statement with its own local values of the variables involved, thus obtaining individual results. So the phrase parallel mustn't lead to an erroneous interpretation: all statements contained in this block are executed sequentially under central control, but they are performed data-parallel. Together with a unique control flow this just re- flects the single instructionJP multiple data (SIMD) machine model. Because of the SIMD restriction, each PE may execute the current instruction or remain idle; a concurrent execution of different instructions is not possible. There are two ways of selecting PEs for parallel execution in Parallaxis which may also be combined: A) by explicitly stating their network position at the entrance of a parallel block for each dimension specified e.g., for a one-dimensional structure: PARALLEL [22..44] ENDPARALLEL B) by using if-, case-selections or while-, repeat-loops with vector conditions e.g.: PARALLEL IF THEN END ENDPARALLEL In case A, only the PEs within the selected range execute the statements inside the parallel block, all others remain idle during that time. PE selections may be constant or variable posi- tional expressions, such as subrange, enumeration, and set. One selection is required for every dimension. In case B, a conditional expression determines which PEs will execute the statements of the then-branch and which will remain idle. While the branching-condition evaluates to true for one processor, it might be false for others. The condition is evaluated for each PE individually in parallel and only those PEs for which the condition evaluates to true will execute the then- branch; all other PEs remain idle. If there exists an else-branch, it will be executed subse- quently with the inverse PE group. The two branches of an if-selection cannot be executed in parallel because of the previously mentioned "single control flow" SIMD restriction. Therefore, they have to be serialized. Processors that execute the then-part may continue while processors executing the else-part are blocked with their identifications pushed onto a global stack at the controlling host. They remain inactive while the host supervises execution of the then-part. Afterwards, when executing the else-part, the processor sets change places, that is, the "then- processors" now become inactive for some time while the "else-processors" are active. The use of a dynamic stack also accounts for nested if-statements. Each stack-entry corresponds to a nesting-level of if-statements. Since serialization degrades the performance of a parallel system, the user should keep this fact in mind when designing a parallel algorithm. The semantics of a parallel loop is analogous: only those PEs satisfying the loop-condition execute the loop-statements. A loop can only be terminated when none of the active PEs satis- fy the loop-condition. As long as a single PE remains, the loop is being continued while all PEs excluded by the loop-condition are idle. This means, the controlling host has to get a feed- back from the network whether there are PEs remaining or not. The implementation of this feedback depends on the hardware facilities of the target system. In any case, the OR-reduction of the condition-vector (see also section 4.3) returns this information to the host. 4.2 Parallel Data Exchange The propagate-operation accounts for data exchange between the parallel processors. Propagate behaves like a compound statement of send followed immediately by receive. All selected (or active) PEs participate in this parallel data exchange. In its simple form, the propagate-opera- tion reads: propagate. ( ) e.g., for the ring structure of section 2.1: propagate.c_wise (x) The semantics of this statement is that the value of a vector (here: "x") is propagated through the network by one step in the stated direction (here: "c_wise" for clockwise). The direction names directly refer to the CONNECTION specification of the network where they were defined as exit-ports. For simple network structures like ring, grid, hypercube, and so on, the corre- sponding entry-port is unambiguous. Therefore, it can be determined automatically by Par- allaxis (in our ring example: "cc_wise"). For complex topologies (as the tree structure, see section 6) both send-port and receive-port have to be supplied, in order to perform the propa- gate-operation. Let us now assume that the PEs labeled 3 to 8 of the twelve ring PEs have been selected (activated) to perform the above propagate-operation. Each PE sends a message (here: the local value of variable "x") to its clockwise neighbor and then reads a message from its counter- clockwise neighbor, since the port cc_wise was defined to be the entry for c_wise in the con- nection specification. Disregarding those basic sends and receives, we recognize that informa- tion has been propagated one shift in clockwise direction all around the selected sector, thus reflecting the operation's name. Though the ring represents a closed topology, the sector se- lected here is equivalent to an open topology. Interesting is the behavior at the boundary, which in our example are PEs no. 3 and 8. PE no. 3 does not have an active neighbor for receiving (direction "cc_wise") while PE no. 8 does not have a neighbor for sending (direction "c_wise"). As of the propagate-operation, this is equivalent to not having a neighbor at all for that direction in an open topology. PE no. 3 first sends its message along to PE no. 4. Since it cannot perform a subsequent receive, its local value of "x" remains unchanged. PE no. 8 's send goes without any effect (its neighbor is inactive) while the receive operation can be performed normally. Other syntactical form of the propagate-operation allow: * the separation between send-expression and receive-variable (in case one does not want to overwrite the variable as in the previous form) e.g.: propagate.c_wise (x+1,y) * the propagation of several steps along a fixed direction e.g.: propagate.c_wise ^5 (x) 4.3 Vector Reduction Sometimes not the whole vector is of interest, but only a scalar reduction of it like its sum. Without a specialized operation, this scalar information is awkward to get. Parallaxis provides load and download operations to move a scalar array at the host to and from a vector distributed over the network. So, in order to get the sum of a vector, one has to download the complete vector to the host and then add it up iteratively. The reduce-function enables a vector reduction in a single statement and without involving additional variables. The desired reduction opera- tion has to be specified after a period. Predefined are sum, product, and, or, min, max, first, and last; user defined functions with appropriate parameters may be used as well. The reduce-operation is completely independent of the specified topology; it is a primitive, one- dimensional vector operation. Specialized components of a parallel system may execute this command in time O(log2 n) in a tree-like operation mode. Assume "s" being a real scalar and "x" being a real vector: s := REDUCE.sum (x) This statement assigns the component-wise sum of vector "x" to scalar "s". 5. Sample Programs Now, we are going to look at a couple of Parallaxis programs. The first shows a parallel sorting algorithm, called "odd-even transposition sorting", the second is a parallel version of the "Sieve of Eratosthenes" for generating prime numbers. "Odd-Even Transposition Sorting" (OETS) is a parallel sorting algorithm that is able to sort n numbers on n PEs in time O(n). The PEs are connected in a bi-directional, open linear list; I/O instructions have been omitted for clarity. In odd iteration steps, the PE-pairs 1-2, 3-4, and so on are compared in parallel while in even iteration steps the pairs 2-3, 4-5, and so on are han- dled. SYSTEM sorting; CONST n = 1000; CONFIGURATION list [1..n]; CONNECTION left : list[i] -> list[i-1].right; right: list[i] -> list[i+1].left; SCALAR k : integer; VECTOR val,r,l : integer; swap : boolean; BEGIN ... (* read input data *) FOR k:=1 TO n DO PARALLEL PROPAGATE.right(val,l); PROPAGATE.left (val,r); (* l/r now hold the left/right neighbors' values *) swap := false; IF odd(k) THEN (* compare 1-2, 3-4, ... *) IF odd(dim1) AND (r < val) THEN val := r; swap := true END ELSE (* even (k) compare 2-3, 4-5, ... *) IF even(dim1) AND (r < val) THEN val := r; swap := true END; END; PROPAGATE.right(swap); IF swap AND (id_no>1) THEN val := l END; ENDPARALLEL END; ... (* write output data *) END sorting. Each PE holds one component of the vector "val" that is to be sorted, as well as local copies of each one's left and right neighbor. The marker variable "swap" is used for the bookkeeping of swap operations to be finished at the right neighbor PEs. A different approach without marker propagation is possible, but complicates the program. The program for generating prime numbers is very much straight-forward and does not need a lot of explanation. In each iteration through the "while"- loop, all multiples of the current number are eliminated in parallel. Execution of the loop continues until no "candidate" (on any PE) is left. SYSTEM sieve; CONFIGURATION list [1000]; CONNECTION (* none *); SCALAR prime: integer; VECTOR candidate: boolean; BEGIN PARALLEL candidate := id_no >=2; WHILE candidate DO prime:= REDUCE.First(id_no); WriteInt(prime,10); WriteLn; IF id_no MOD prime = 0 THEN candidate:=FALSE END END ENDPARALLEL END sieve. 6. Parallaxis Syntax Parallaxis Language Definition (c) Thomas Braunl, Universitaet Stuttgart, 1989 System = SYSTEM sys_ident ";" { ConstantDecl | TypeDecl } HardwareDecl SoftwareDecl sys_ident "." . ConstantDecl = CONST { ident "=" ConstExpr ";" } . TypeDecl = TYPE { ident "=" type ";" } . HardwareDecl = CONFIGURATION conf_ident IntRange { "," IntRange } ";" CONNECTION [ TransferFunc { ";" TransferFunc } ] ";" . IntRange = "[" range "]" . range = int_ConstExpr [ ".." int_ConstExpr ] . TransferFunc = out_direction ":" conf_ident "[" source { "," source } "]" ( "->" | "<->" ) destination { "," destination } . direction = ident [ "(" (integer | const_ident) ")" ] . source = ident | integer. destination = [ discriminant ] conf_ident "[" ExprList "]" "." in_direction. discriminant = "{" bool_expr "}". SoftwareDecl = VariableDecl { ProcedureDecl ";" } block . VariableDecl = [ ControlVarDecl ] [ LocalVarDecl ] . ControlVarDecl = SCALAR { ident { "," ident } ":" type ";" } . LocalVarDecl = VECTOR { ident { "," ident } ":" type ";" } . ProcedureDecl = PROCEDURE proc_ident [FormalParams] ";" { ConstantDecl | TypeDecl } SoftwareDecl proc_ident . FormalParams = "(" [ parameters { ";" parameters } ] ")" [ ":" ( SCALAR | VECTOR ) function_type ] . parameters = SCALAR [ VAR ] ident { "," ident } ":" type | VECTOR [ VAR ] ident { "," ident } ":" type . block = BEGIN StatementSeq END . StatementSeq = statement { ";" statement } . statement = [ assignment | ProcedureCall | IfSelection | CaseSelection | WhileLoop | RepeatLoop | LoopStatement | ForLoop | WithStatement | EXIT | RETURN [ expr ] | ParallelExec | Propagate | Load | Store ] . ParallelExec = PARALLEL selection StatementSeq ENDPARALLEL . selection = [ "[" entry "]" { "," "[" entry "]" } ] . entry = range { "," range } | expr [ ".." expr ] | "*" . Propagate = PROPAGATE "." out_dirvar [ "^" ( integer | ident) ] [ "." in_dirvar ] "(" vector_designator [ "," vector_designator ] ")" . dirvar = ident [ "(" expr ")" ] . Load = LOAD selection "(" vector_designator "," scalar_designator [ "," length_designator ] ")" . Store = STORE selection "(" vector_designator "," scalar_designator [ "," length_designator ] ")" . Reduce = REDUCE "." operator_ident selection "(" expr ")" . assignment = designator ":=" expr . designator = ident { "[" ExprList "]" | "." ident } . ProcedureCall = proc_ident [ "(" ExprList ")" ] . FunctionCall = proc_ident "(" [ ExprList ] ")" . IfSelection = IF bool_expr THEN StatementSeq { ELSIF bool_expr THEN StatementSeq } [ ELSE StatementSeq ] END . CaseSelection = CASE expr OF case { "|" case } [ ELSE StatementSeq ] END. case = CaseLabels { "," CaseLabels } ":" StatementSeq . CaseLabels = ConstExpr [ ".." ConstExpr ] . WhileLoop = WHILE bool_expr DO StatementSeq END . RepeatLoop = REPEAT StatementSeq UNTIL bool_expr . LoopStatement = LOOP StatementSeq END . ForLoop = FOR ident ":=" expr TO expr [ BY ConstExpr ] DO StatementSeq END . WithStatement = WITH designator DO StatementSeq END . type = SimpleType | ArrayType| RecordType | SetType . SimpleType = type_ident | enumeration | subrange . enumeration = "(" const_ident { "," const_ident } ")" . subrange = "[" ConstExpr ".." ConstExpr "]" . ArrayType = ARRAY SimpleType { "," SimpleType } OF type . RecordType = RECORD FieldListSeq END . FieldListSeq = FieldList { ";" FieldList } . FieldList = [ ident { "," ident } ":" type ] . (* variant records not yet supported *) SetType = SET OF SimpleType . ExprList = expr { "," expr } . expr = SimpleExpr { relation SimpleExpr } . relation = "=" | "<>" | "#" | "<" | ">" | "<=" | ">=" | IN . SimpleExpr = [ "+" | "-" ] term { AddOperator term } . AddOperator = "+" | "-" | OR . term = power { MulOperator power} . MulOperator = "*" | "/" | DIV | MOD | AND | "&" . power = factor { "^" factor } . factor = FunctionCall | Reduce | string | set | number | designator | structure | "(" expr ")" | NOT factor . string = "'" { character } "'" | '"' { character } '"' . set = [type_ident] "{" [ element { "," element } ] "}" . element = ConstExpr [ ".." ConstExpr ] . structure = record_ident "(" ExprList ")" . CExprList = [ ConstExpr { "," ConstExpr } ] . ConstExpr = SimpleConstExpr { relation SimpleConstExpr } . SimpleConstExpr = [ "+" | "-" ] ConstTerm { AddOperator ConstTerm } . ConstTerm = ConstPower { MulOperator ConstPower } . ConstPower = ConstFactor { "^" ConstFactor } . ConstFactor = const_ident | number | string | set | recarr_ident "(" CExprList ")" | stdfct_ident "(" CExprList ")" | "(" ConstExpr ")" | NOT ConstFactor . number = integer | real . integer = digit {digit} . real = digit {digit} "." {digit} ["E" ["+" | "-"] digit {digit}] . ident = letter { letter | digit } . character = letter | digit | "$" | .. (* any character, e.g. ASCII *) . digit = "0" | . . | "9" . letter = "A" | . . | "Z" | "a" | . . | "z" | "_" .