Non-speculative enhancements for the scheduling logic

  1. GRAN TEJERO, RUBEN
Dirigida por:
  1. José María Llaberia Griño Director/a
  2. Enrique Morancho Llena Codirector/a
  3. Angel Olive Duran Codirector/a

Universidad de defensa: Universitat Politècnica de Catalunya (UPC)

Fecha de defensa: 05 de noviembre de 2010

Tribunal:
  1. Juan José Navarro Guerrero Presidente/a
  2. Ramon Canal Corretger Secretario/a
  3. Víctor Viñals Yufera Vocal
  4. Pablo Enrique Ibáñez Marín Vocal
  5. Francisco Tirado Fernández Vocal

Tipo: Tesis

Teseo: 111476 DIALNET

Resumen

This thesis focuses on a tradeoff between desirable objectives in High-Performance Out-of-Order processors. In particular, we refer to the difficulty that exists between operating at a high frequency and the ability to expose and exploit instruction level parallelism of the executed code. In Out-of-Order processors, the issue queue and the scheduler are related to exposure and exploitation of instruction level parallelism. Two design parameters of these two elements are the number of entries (instructions) of issue queue and the issue-width. A high-performance processor requires that the values of these parameters to be as high as possible. However, the cost (area and delay) of the issue queue and the scheduler also grow with these parameters, thereby increasing the value of these parameters may compromise the processor cycle time. In this thesis we employ two different techniques that can alleviate the conflict between frequency and cost (area and delay) of the scheduling stage. On the one hand, pipelining the scheduling stage. On the other hand, slicing the selection logic. The aim of the first technique is to increase the number of entries in the instruction queue without increasing cycle time. The aim of the second technique is to remove the implicit serialization of arbiters in the selection logic due to issue-ports can handle the same instruction type. Also, this second technique reduces area and delay of arbiters. However, these techniques involve some performance losses when applied in a straight-forward manner. A pipelined scheduling logic distributes its delay into several stages which may reduce pressure on processor cycle time. However, some instructions cannot be executed back-to-back with their producer instructions which may suppose a performance loss. This thesis proposes the idea of Dependence Level Scheduler (DLS), which is able to tolerate the latency in the wakeup-select hardware-loop. The proposal makes use of the observation that usually the selection logic schedules all competing for selection instructions in a single cycle. In DLS, producer instructions (shorter execution-latency than the hardware-loop latency) can wakeup in advance to their dependent instructions, this is, once they begin to compete for selection. This manner, the selection phase of producer instructions and the wakeup phase (in advance) of their consumer instructions are overlapped. Since the observation, in which DLS relies on, is not met always, it is necessary to manage the situation in which there are more producer instructions than the ones that can be scheduled in a single cycle. The second part of this thesis focuses on the selection logic and its "sequential nature". When an instruction type can be issued through several issue-ports, the arbiters in charge of these issue-ports must operate in serie. In contrary, all these arbiters may pick up the same instruction. One option consists in slicing the issue queue into ranges of consecutive entries. Issue-queue entries in the same range only compete for the same issue-port in the same arbiter. We call these ranges, arbitration domains. Thanks to slicing the selection logic into arbitration domains, arbiters of the selection logic can operate in parallel and also, cost of each arbiter (reduction in area and delay) is smaller than in a conventional sequential selection logic (Ideal). On the other hand, this design has a worse performance than an Ideal selection logic due to several reasons. The proposal of this thesis looks for obtaining the ideal performance of a sequential model, and the costs (area and delay) of a model sliced into arbitration domains. This proposal uses the surplus of ready instructions in some arbitration domains with those arbitration domains that are lack of ready instructions. In order to do this, each arbitration domain attempts to pre-select several instructions. When an arbitration domain has no ready instruction, arbitration domains with a surplus of instructions can also issue one instruction through that issue port. Following this idea, we propose four design alternatives that manage to improve the performance of the former selection logic sliced into arbitration domains. In addition, with respect to the sliced selection-logic, our proposals shortly increase its costs (area and delay) in comparison with the additional costs of an ideal selection logic fully sequential.