# 4. Basic Digital Circuits¶

Digital circuits are compositions of logic gates. A small number of
digital circuits occur frequently in larger digital designs, such as
multiplexers, encoders, decoders, and memory elements. In this
chapter, we use the *method of logical effort* to
study those basic digital circuits that serve as building blocks for
the construction of larger digital systems.

## 4.1. Logic Gates¶

In Section *Logic Gates*, we introduce logic gates
with one and two inputs. Often, we need gates with more than 2
inputs, or wish to design new logic gates for specific logic functions
or timing behavior. In the following we study the design and
characterization of logic gates as elementary building blocks for
digital circuits.

### 4.1.1. Logic Gates with Multiple Inputs¶

Assume we design a digital circuit and need a NAND gate with 3 inputs. We may assemble the 3-input NAND gate using 2-input NAND gates and an inverter as building blocks, see Figure 4.1. Using Boolean algebra, it is straightforward to show that this circuit implements the logic function \(Y = \overline{A\,B\,C}.\) There are several problems with this implementation, though. First, the delay from inputs \(A\) and \(B\) to \(Y\) is larger than from input \(C.\) Such asymmetries can be confusing when designing larger circuits with delay constraints. Second, the delay of the longest path is larger than necessary, and, third, the CMOS implementation needs 10 transistors, which is also more than necessary.

An alternative design for the 3-input NAND gate uses CMOS transistors as building blocks, as shown in Figure 4.2. This circuit needs only 6 transistors, and is symmetric w.r.t. its inputs. Output \(Y\) is 0 only if all inputs are 1.

A = 0 B = 0 C = 0 Figure 4.2: CMOS circuit of 3-input NAND gate and interactive switch model.

The circuit topology in Figure 4.2 extends to
\(n\)-input NAND gates for \(n \ge 2\): compose \(n\) pMOS
transistors in parallel and \(n\) nMOS transistors in series.[1] The series composition of the nMOS transistors determines
the on-resistance of the pull-down path. The larger number of inputs,
the smaller the pull-down current and, hence, the larger the delay of
the gate. The *model of logical effort* reflects
this dependence of delay on the number of inputs in the logical effort
of the gate. The matched \(n\)-input NAND gate increases the
width of the nMOS transistors to match the drive current of the
reference inverter. Increasing the transistor widths reduces the
on-resistance at the expense of increasing the input capacitance of
the gate. This transistor sizing does not solve the problem that the
delay increases with increasing numbers of inputs. It merely shifts
the burden from the gate itself to its driver.

To determine the logical effort and parasitic delay of the 3-input NAND gate, we size the transistors to match the output current of the reference inverter. Figure 4.3 shows the matched 3-input NAND gate with transistor widths. When all inputs are 1, all three nMOS transistors are switched on. The series of on-resistances sums to a total of \(3 R_n(nand).\) To be equal to on-resistance \(R_n(inv)\) of the reference inverter, we need to triple the widths of the nMOS transistors of the NAND gate, \(W_n(nand) = 3 W_n(inv) = 3.\) If one of the NAND inputs is 0, then one pMOS transistor is switched on. To match the on-resistanace of the reference inverter, we assign the same width \(W_p(nand) = W_p(inv) = 2\) to the pMOS transistors of the NAND gate. If more than one NAND input is 0, then the parallel composition of pMOS transistors has a lower on-resistance than for a single input. The corresponding pull-up delay is smaller if only one input is 0. Therefore, using normalized pMOS widths of 2 units matches the reference inverter in the worst case when the pull-up delay of the NAND gate is largest.

The matched 3-input NAND gate enables us to determine the logical effort of a 3-input NAND gate by means of the normalized input capacitances. Each input, e.g. input \(A,\) drives two parallel gate capacitances of one pMOS and one nMOS transistor:

Therefore, the logical effort of the 3-input NAND gate is

per input. Note that the logical effort of the 3-input NAND gate is by \(1/3\) larger than the logical effort of the 2-input NAND gate. Furthermore, we determine the parasitic delay of the 3-input NAND gate by calculating the normalized output capacitance:

Thus, the parasitic delay of a 3-input NAND gate is

We find that the parasitic delay of the 3-input NAND gate is by one delay unit larger than that of the 2-input NAND gate.

For NAND gates with \(n\) inputs, the matched gate has nMOS transistors of width \(n\) and pMOS transistors of width 2. Therefore, the logical effort of the \(n\)-input NAND gate is

per input, and has parasitic delay

The larger the number of inputs \(n,\) the larger the logical effort and the parasitic delay of the NAND gate. As a consequence, the more inputs the gate has, the slower it is, or, from the circuit designer perspective, the more effort we need to invest in the driver circuit to keep the path delay low.

The design of an \(n\)-input NOR gate procedes analogously to the \(n\)-input NAND gate. The \(n\)-input NOR gate has \(n\) parallel nMOS transistors and \(n\) pMOS transistors in series. The matched \(n\)-input NOR gate has nMOS transistors of width \(W_n = 1\) and pMOS transistors of width \(W_p = 2n.\) As a result each input of the \(n\)-input NOR gate has logical effort

and the parasitic delay of the gate is

For example, a 3-input NOR gate has logical effort \(g_{nor3} = 7/3\) per input and parasitic delay \(p_{nor3} = 3.\) The logical effort of an \(n\)-input NOR gate is larger than the logical effort of an \(n\)-input NAND gate. Thus, if we have the choice to implement the logic of a circuit path with NAND gates or NOR gates, we commonly prefer NAND gates, because they tend to yield lower path delays.

### 4.1.2. Tree-Structured Logic Gates¶

When the number of inputs of logic gates is large, tree-structured gates offer superior performance compared to CMOS gates. As a concrete example, consider a 16-input NAND gate. The logical effort of a CMOS NAND gate with a pull-down network of 16 nMOS transistors in series is \(g_{nand16} = 6\) according to Equation (1), and the parasitic delay is \(p_{nand16} = 16\) according to Equation (2). Figure 4.4 shows an alternative, tree-structured implementation that uses 2-input NAND gates and inverters as building blocks. The tree has a path logical effort of \(G = g_{nand2}^4 g_{inv}^3 = 3.16,\) and a path parasitic delay of \(P = 4 p_{nand2} + 3 p_{inv} = 11.\) The tree structure improves both quantities.

The tree structure of the NAND circuit is inspired by Boolean algebra, more succinctly the associativity of the AND operation:

Associativity implies that we can parenthesize an expression any way
we want to. We may even omit the parentheses altogether because,
mathematically, the parenthesization does not affect the value of the
expression. From the perspective of the circuit designer, however,
parentheses express an *evaluation order*. For example, given a
binary AND operation and 8 operands \(A_0, A_1, \ldots, A_7,\)
there are many different evaluation orders. One of them corresponds to
the left-skewed parenthesization

which groups the operations such that all 7 AND operations must be
executed one after each other. The corresponding circuit has a
**depth** of 7 operations. In contrast, the tree-structured
parenthesization yields the minimum depth of \(\log_2 8 = 3\)
operations:

Figure 4.5 shows the corresponding circuits. We note that the tree structure is asymptotically optimal, i.e. for a large enough number of inputs \(n,\) no other circuit topology has a depth smaller than \(\log_2 n,\) up to constant factors. The constants depend on the number of inputs of the building blocks. For example, if we replace subtrees of three 2-input AND gates with one 4-input AND gate, the depth of the tree-structured AND gate in Figure 4.5 decreases from 3 to 2 levels.

Given a tree-structured \(n\)-input AND gate, we obtain an
\(n\)-input NAND gate by complementing the output. The 16-input
NAND gate in Figure 4.4 implements each 2-input AND gate of
the tree with a 2-input CMOS NAND circuit followed by an inverter. To
complement the output, we omit the inverter at the root of the tree.
The 3-input NAND gate in Figure 4.1 is an example of an
*unbalanced* NAND tree, where the number of inputs is not a power of
2. In contrast, the trees in Figure 4.4 and Figure 4.5
are *balanced*, because each path from input to output has the same
depth, or number of gates.

### 4.1.3. Asymmetric Gates¶

An **asymmetric gate** has inputs with different logical efforts.
Asymmetric gates require additional attention from the circuit
designer, but facilitate faster circuits if used appropriately. The
asymmetry may be caused by the circuit topology or by deliberate
transistor sizing to reduce the delay of the critical path. We
design a majority gate to motivate asymmetric circuit topologies,
and then discuss how to trade logical effort between inputs.

#### Majority Gate¶

The **majority operation** is a ternary Boolean operation,
cf. *majority function*:

In terms of a truth table the majority operation is specified as:

A B C M 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 1

As the name suggests, the output of the majority operation is the value that occurs more often in inputs \(A,\) \(B,\) and \(C.\) The inverting 3-input majority gate in Figure 4.6 computes the complement \(Y = \overline{M}.\)

A = 0 B = 0 C = 0 Figure 4.6: Symmetric CMOS circuit of inverting 3-input majority gate and interactive switch model.

We characterize the delay of the majority gate by deriving its logical
effort and parasitic delay. The majority gate has six **arms**, each
of which is a series composition of two transistors. To match the
on-resistances of the reference inverter, we make the transistors in
each arm of the majority gate twice as wide as those of the reference
inverter. Thus, as shown in Figure 4.7, all pMOS
transistors have width \(W_p = 4\) and all nMOS transistors have
width \(W_n = 2.\) These transistor widths match the reference
inverter for all combinations of input values, except when all input
values are equal. In this case all three arms of either pull-up or
pull-down network are switched on, such that the parallel composition
yields a three times smaller equivalent on-resistance. We classify
these input combinations as exceptionally fast, analogous to the
exceptional input combinations of the NAND or NOR gates.

Each input of the majority gate drives two pMOS and two nMOS transistors. Therefore, all three inputs \(A,\) \(B,\) and \(C\) have logical effort

The normalized output capacitance of the majority gate consists of three pMOS and three nMOS transistors. Therefore, the parasitic delay of the majority gate is

Compared to a NAND or NOR gate, the majority gate is relatively slow.
One option for reducing the logical effort and parasitic delay of the
majority gate is **transistor sharing**. Observe in Figure 4.7 that the two pMOS transistors that one input drives in the
pull-up network are in distinct arms. The pull-down network exhibits
the same structure only with nMOS transistors. If two arms share one
transistor, we can reduce the logical effort by reducing the number of
transistors from two to one. This does not work for all inputs
simultaneously. As shown in Figure 4.8, if we share the transistors of input \(A,\) we
cannot also share the transistors of inputs \(B\) and \(C\)
across arms without altering the logic function. However, sharing the
transistors of input \(A\) reduces its logical effort, and also
the parasitic delay of the entire gate.

A = 0 B = 0 C = 0 Figure 4.8: Asymmetric inverting 3-input majority gate with transistor sharing and interactive switch model.

Sharing the transistors of input \(A,\) as shown in Figure 4.8, does not affect the transistor widths of the matched majority gate. A path from supply rail to output consists of a series composition of two transistors, either in the pull-up or pull-down network, just like in Figure 4.7. The exceptional case, when all three input values are equal, has a larger equivalent on-resistance than the gate in Figure 4.7, but is still faster than the regular case. Thus, the matched version of the majority gate in Figure 4.8 has transistor widths \(W_p = 4\) and \(W_n = 2.\) While the logical efforts of inputs \(B\) and \(C\) remain unchanged, transistor sharing halves the logical effort of input \(A\):

Since the logical effort of input \(A\) differs from the logical
effort of inputs \(B\) and \(C,\) the majority gate in
Figure 4.8 is an **asymmetric gate**.
In contrast, the majority gate in Figure 4.6 is a
**symmetric gate**, because all inputs have equal logical effort.
Rather than characterizing an asymmetric gate by listing the logical
efforts for all inputs, it is sometimes sufficient to use a single
number instead. The **total logical effort** is the sum of the
logical efforts of all inputs. For example, the total logical effort
permits a quantitative comparison of the symmetric and asymmetric
majority gates. The symmetric majority gate has total logical effort
\(g_{M} = g_{M}(A) + g_{M}(B) + g_{M}(C) = 3 \cdot 4 = 12.\)
Transistor sharing reduces the total logical effort of the asymmetric
gate to \(g_{Masym} = 2 + 2 \cdot 4 = 10.\)

Sharing the transistors of input \(A\) reduces not only the logical effort but also the output capacitance and, thus, the parasitic delay of the asymmetric majority gate:

Note that we can construct alternative gate designs by sharing the transistors of input \(B\) or of input \(C.\) The resulting asymmetric gates do not reduce the output capacitance, however. Only sharing the transistors driven by input \(A\) reduces the parasitic delay. Thus, if we have the choice, we prefer sharing those transistors connected to the output of the gate in order to reduce the parasitic delay as added benefit.

#### Asymmetric Transistor Sizing¶

The asymmetric majority gate sacrifices the symmetry of the circuit topology for a reduced logical effort of one input without affecting the logical effort of the other inputs. The benefit of the reduced logical effort is a reduced gate delay for signal transitions on the corresponding input. Even with a symmetric gate topology, inherent to NAND and NOR gates for instance, we can trade logical effort between inputs by transistor sizing. The resulting asymmetry may be desirable to speed up the critical path of a circuit.

As an example, consider the 2-input NAND gate in Figure 4.9. Assume that input \(A\) is critical, and we wish to reduce its delay. Thus, our goal is to minimize the logical effort of input \(A,\) which means to reduce \(g_{nand2}(A)\) from 4/3 to that of a reference inverter \(g_{inv} = 1.\)

Since the matched NAND gate has normalized nMOS transistor widths \(W_n(A) = W_n(B) = 2,\) reducing width \(W_n(A)\) to 1 reduces the input capacitance of input \(A\) as desired. However, if all we do is reduce the input capacitance of input \(A,\) the gate becomes mismatched w.r.t. the reference inverter. In fact, halving the width of the nMOS transistor doubles its on-resistance, which results in a smaller output current. Consequently, the gate would be slower rather than faster as planned. To obtain a delay reduction, we need to reduce the logical effort, i.e. we wish to reduce the input capacitance of input \(A\) without changing the equivalent on-resistance of the pull-down network. Transistor sharing in the majority gate has exactly this effect.

In case of the 2-input NAND gate, we can balance a change of width \(W_n(A)\) by a corresponding change of \(W_n(B)\) so that the pull-down network of the asymmetric NAND gate remains matched to the reference inverter. More specifically, matching an asymmetric 2-input NAND gate requires that equivalent on-resistance \(R_n(A) + R_n(B)\) of the series composition of nMOS transistors must be equal to on-resistance \(R_n(inv)\) of the reference inverter:

Since the on-resistance is indirectly proportional to the transistor
width, we obtain this *matching constraint* for the asymmetric NAND gate:

because the reference inverter has an nMOS transistor of width \(W_n(inv) = 1.\) We can fulfill this constraint with a scaling fraction \(\alpha,\) \(0 < \alpha < 1,\) if we choose \(W_n(A) = 1/(1-\alpha)\) and \(W_n(B) = 1/\alpha,\) as shown in Figure 4.9. For \(\alpha = 1/2,\) we obtain the symmetric matched NAND gate with \(W_n(A) = W_n(B) = 2.\) Scaling fraction \(\alpha < 1/2\) reduces the logical effort of input \(A.\) As \(\alpha\) approaches 0, width \(W_n(A)\) and logical effort \(g_{nand2asym}(A)\) approach value 1. The table below lists the relevant quantities for several values of \(\alpha.\)

\(\alpha\) | \(W_n(A)\) | \(W_n(B)\) | \(g_{nand2asym}(A)\) | \(g_{nand2asym}(B)\) | \(g_{nand2asym}\) |
---|---|---|---|---|---|

1/2 | 2 | 2 | 4/3 | 4/3 | 2.67 |

1/4 | 4/3 | 4 | 10/9 | 2 | 3.11 |

1/10 | 10/9 | 10 | 28/27 | 4 | 11.04 |

1/100 | 100/99 | 100 | 298/297 | 34 | 101.0 |

The price of reducing the logical effort of input \(A\) is a rapidly growing logical effort of input \(B.\) Choosing \(\alpha = 1/4\) appears to be a good compromise, where \(g_{nand2asym}(A) = 1.1\) is within \(10\,\%\) of the minimum logical effort, yet the total logical effort of the gate increases by \(16\,\%\) only.

### 4.1.4. Compound Gates¶

There exist complex Boolean expressions that we can implement in a single CMOS gate. In particular, all Boolean expressions formed with AND and OR operations can be cast into a CMOS gate if all variables appear in uncomplemented form only and the entire AND-OR expression is complemented.

Two widely used families of compound gates are the **inverting AND-OR
(AOI)** gates and the **inverting OR-AND (OAI)** gates. As an
example, consider the Boolean expression

If our CMOS gate supply consists of NAND, NOR, and NOT gates, we might
be tempted to use the *involution theorem*, and
transform the expression such that we can implement \(Y\) with
these gates:

The circuit for \(Y\) is shown in Figure 4.10. The logical effort of the longest paths \(A \rightarrow Y\) and \(B \rightarrow Y\) is \(G = g_{nand} g_{inv} g_{nor} = 20/9\) and the corresponding path parasitic delay is \(P = p_{nand} + p_{inv} + p_{nor} = 5.\)

Now, recall the *completeness theorem*, which tells us that any Boolean
expression of AND and OR operations can be realized with a
series-parallel switch network. In our example, \(Y=0\) if and
only if \(A=1\) AND \(B=1,\) OR if \(C=1.\) This logical
equivalence translates directly into the CMOS pull-down network in
Figure 4.11(a). We implement the conjunction of \(A\)
and \(B\) with a series composition of nMOS transistors, and the
disjunction of \(AB\) and \(C\) with a parallel composition.
With the pull-down network in place, we derive the pull-up network in
Figure 4.11(b) systematically by applying the
*principle of duality*. The pull-up network
uses pMOS instead of nMOS transistors, and we replace the parallel
composition with a series composition and vice versa.

Figure 4.11(c) redraws the pull-up network
slightly for the sake of clarity, and annotates the transistor sizes
needed to match the reference inverter. The gate is *asymmetric*, because the logical efforts of the inputs differ:

The parasitic delay of the gate is \(p = 7/3.\) Compared to the
logically equivalent circuit in Figure 4.10, the CMOS gate
offers improvements in terms of speed and number of transistors. To
emphasize the fact that our CMOS gate is a compound gate, we introduce
the symbol in Figure 4.12 by fusing an AND gate with a NOR
gate. Since the output of the AND gate drives the input of the OR
gate whose output is complemented by an inverter, the compound gate is
an *inverting AND-OR gate* or *AOI* gate for short.

AOI gates exist for the complement of all Boolean expressions in
*SOP normal form*.

Analogous to AOI gates, the family of OAI gates, or *inverting OR-AND
gates*, consists of compound gates for Boolean expressions in
complemented *POS normal form*. As an example,
consider Boolean expression

The OAI gate for \(Y\) and the corresponding OAI symbol are shown in Figure 4.13. The matched gate is symmetric, with logical effort \(g=2\) per input and parasitic delay \(p=4.\)

Besides the AOI and OAI gates for complemented two-level normal forms, we can design compound gates for multilevel expressions as well. For example, Boolean expression

has a pull-down network consisting of a series composition of \(D\) and a parallel composition of \(A\) and a series composition of \(B\) and \(C.\)

The compound gate in Figure 4.14 is *asymmetric*, because the logical efforts of the inputs differ:

The parasitic delay of the compound gate is \(p = 8/3.\)

CMOS designers often use compound gates as efficient implementation
target. If one of the inputs must be complemented, then it is easy to
introduce a separate inverter stage to drive the input of the compound
gate. If we want an *AO* or *OA* function without inverted output, we
may use an additional inverter stage to complement the output of the
corresponding *AOI* or *OAI* gate.

Design a CMOS compound gate to compute

\(\qquad{\displaystyle Y = \overline{(A + B + C) \cdot D}\,.}\)

- Derive either the pull-down or the pull-up network.
- Deduce the complementary network using the principle of duality.
- Perform transistor sizing to match the compound gate.
- Determine the logical effort(s) and parasitic delay.

We derive the pull-down network by noticing in the Boolean expression that \(Y = 0\) if \((A+B+C)\cdot D = 1.\) Therefore, we design a series-parallel pull-down network that connects output \(Y\) to

*GND*when the network is switched on. Boolean subexpression \(A+B+C\) translates into a parallel composition of nMOS transistors, and the conjunction with \(D\) into a series composition. The Figure below shows the pull-down network in part (a).We derive the pull-up network from the pull-down network using the principle of duality. For each nMOS transistor we introduce the corresponding pMOS transistor. The parallel composition of the nMOS transistors controlled by inputs \(A,\) \(B,\) and \(C\) becomes a series composition of pMOS transistors. Furthermore, the series composition of nMOS transistor \(D\) with the parallel composition turns into a parallel composition of pMOS transistor \(D\) with the series composition of pMOS transistors. The pull-up network is shown as part of the complete CMOS compound gate in part (b) of the Figure above.

The transistor sizes for the matched compound gate are chosen to have the same on-resistances as the reference inverter, see part (c) of the Figure above. We start with the pull-down network. The reference inverter has an nMOS transistor of width \(W_n(inv) = 1.\) Our compound gate has three paths to switch on the pull-down network, through nMOS transistor \(D\) and one of nMOS transistors \(A,\) \(B,\) or \(C.\) The largest on-resistance appears if exactly one of the paths is switched on. This path consists of a series composition of two nMOS transistors. Since the on-resistance of two resistors in series add, we want each on-resistance to be half that of the reference inverter. We achive this goal by making the nMOS transistors twice as wide, i.e. we choose widths \(W_n = 2 \cdot W_n(inv) = 2\) for all nMOS transistors. If more than two paths through the pull-down network are switched on, the on-resistance will be smaller than that of the reference inverter, which we consider as an exceptional case that is faster than the worst case.

The pull-up network must have the same on-resistance as the pMOS transistor of the reference inverter of width \(W_p(inv) = 2.\) The pull-up network has two paths. In the worst case, only one of these paths is switched on. To match the path through pMOS transistor \(D,\) we assign width \(W_p(D) = W_p(inv) = 2.\) The other path consists of a series composition of three pMOS transistors. If each of these transistors has one third of the on-resistance of the pMOS transistor of the reference inverter, the equivalent on-resistance of the path equals that of the reference inverter. Thus, we triple the widths \(W_p(A) = W_p(B) = W_p(C) = 3 \cdot W_p(inv) = 6.\) In the exceptional case of input combination \(ABCD=0000,\) both pull-up paths are switched on, the equivalent on-resistance of the parallel paths is smaller than that of the reference inverter, and the compound gate will be faster.

The logical effort of a compound gate input is its input capacitance divided by the input capacitance of the reference inverter \(C_{in}(inv) = 3.\) Consider input \(A\) of our compound gate. Input capacitance \(C_{in}(A)\) is the sum of the normalized transistor widths input \(A\) controls, here \(W_p(A) = 6\) and \(W_n(A) = 2.\) Thus, we find \(C_{in}(A) = 6 + 2 = 8.\) Inputs \(B\) and \(C\) have the same input capacitance, whereas input \(D\) has input capacitance \(C_{in}(D) = 2 + 2 = 4.\) The compound gate is asymmetric with logical efforts:

\[g(A) = g(B) = g(C) = \frac{8}{3}\,,\quad g(D) = \frac{4}{3}\,.\]The parasitic delay of the compound gate is the ratio of its output capacitance and the output capacitance of the reference inverter \(C_{out}(inv) = 3.\) The output capacitance of the compound gate is the sum of the normalized widths of all transistors connected to output \(Y,\) here \(C_{out} = W_p(C) + W_p(D) + W_n(D) = 6 + 2 + 2 = 10.\) Therefore, the parasitic delay of our compound gate is

\[p = \frac{10}{3}\,.\]

### 4.1.5. XOR and XNOR Gates¶

Our CMOS implementation of the XOR gate in Figure 1.42 deserves a closer look. Unlike NAND or NOR gates, the XOR gate assumes that the inputs are available in complemented and uncomplemented form, and the pull-up and pull-down networks are not duals of each other. First, however, we characterize the delay of the 2-input XOR gate by deriving its logical effort and parasitic delay. The matched 2-input XOR gate is shown in Figure 4.15. Each arm is a series composition of two transistors. To match the on-resistances of the reference inverter, we make the transistors in each arm of the XOR gate twice as wide.

Each of the four inputs of the 2-input XOR gate, \(A,\) \(\overline{A},\) \(B,\) and \(\overline{B}\) drives one pMOS and one nMOS transistor. Therefore, each input has logical effort

The normalized output capacitance of the XOR gate consists of two pMOS and two nMOS transistors, so that the parasitic delay is

Compared to the 2-input NAND and NOR gates, the XOR gate uses twice as many transistors, whose capacitances make it relatively slow.

#### Symmetric XOR Gates¶

XOR gates with \(n\)-inputs implement the **odd-parity**
operation, which outputs a 1 if the number of inputs with value 1 is
odd, cf. ref:parity function <parity-function>. For example, for
\(n = 3,\) function \(Y = A \oplus B \oplus C\) equals 1 if
the number of 1-inputs is 1 or 3. The truth table shows the 3-input
XOR function or odd-parity function.

A B C Y 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 0 1 0 1 1 0 0 1 1 1 1

The odd-parity property suggests a design for an XOR CMOS circuit. To build an \(n\)-input XOR gate, we need \(2^n\) arms with \(n\) transistors each. Half of the arms are in the pull-up network, one per minterm, and each has an odd number of complemented inputs. The other half of the arms are in the pull-down network. For each combination of input values, exactly one arm is switched on. Figure 4.16 shows the CMOS circuit for a 3-input XOR gate. Whenever you toggle one of the inputs output \(Y\) toggles, because changing one input changes the parity from odd to even or vice versa.

A = 0 B = 0 C = 0 Figure 4.16: CMOS circuit of 3-input XOR gate and interactive switch model.

The matched 3-input XOR gate has pMOS transistors of width \(W_p = 6\) and nMOS transistors of width \(W_n = 3.\) Each of the six inputs \(A,\) \(\overline{A},\) \(B,\) \(\overline{B},\) \(C,\) and \(\overline{C}\) drives two pMOS and two nMOS transistors, resulting in a logical effort of \(g_{xor3} = 6\) per input. The output is connected to four pMOS and four nMOS transistors, that contribute to the output capacitance and parasitic delay \(p_{xor3} = 12.\) Since the gate topology generalizes to \(n\) inputs, we find that the \(n\)-input XOR gate has logical effort

per input. The parasitic delay of the \(n\)-input XOR gate amounts to

The exponential growth of both logical effort and parasitic delay in
the number of inputs \(n\) limits the applicability of this CMOS
circuit to small \(n.\) For larger numbers of inputs
*tree-structured* topologies based on 2-input
XOR gates as building blocks are faster.

#### Asymmetric XOR Gates¶

We can reduce the logical effort and parasitic delay of the XOR gate
by *transistor sharing*. Notice in
Figure 4.16 that each input drives two pMOS and
two nMOS transistors. If two arms share one transistor, we can reduce
the logical effort of the corresponding input. Sharing does not work
for all inputs simultaneously. As shown in Figure 4.17 we share the transistors of inputs \(A\) and
\(\overline{A}\) to reduce their logical efforts and the parasitic
delay of the entire gate. Then, we can also share the transistors of
inputs \(C\) and \(\overline{C},\) but not those of inputs
\(B\) and \(\overline{B}\) without altering the logic
function.

A = 0 B = 0 C = 0 Figure 4.17: Asymmetric CMOS circuit of 3-input XOR gate and interactive switch model.

To match the XOR gate, observe that any path from supply to output consists of a series composition of 3 transistors, just like an arm in Figure 4.16. Therefore, sharing transistors does not affect the transistor widths of the matched gate. Like the symmetric XOR gate, all pMOS transistors of the matched gate have width \(W_p = 6\) and all nMOS transistors width \(W_n = 3.\) As a result, the XOR gate in Figure 4.17 is asymmetric. The logical efforts of inputs \(A,\) \(\overline{A},\) \(C\) and \(\overline{C}\) are

while the logical efforts of inputs \(B\) and \(\overline{B}\) remain unchanged compared to the symmetric design:

Sharing the transistors connected to the output of the gate halves the output capacitance, and therefore, the parasitic delay of the asymmetric XOR gate is

The asymmetric XOR gate has smaller total logical effort and smaller parasitic delay than the symmetric gate.

#### XOR Gates and Duality¶

The symmetric 2-input XOR gate in Figure 4.15 violates the
principle of duality, because the pull-up and pull-down networks are
not duals of each other. In this section, we use the 2-input XOR
gate to demonstrate that **duality is a sufficient but not a necessary
property of CMOS gates**.

Recall the truth table of the 2-input XOR gate, \(Y = A \oplus B\):

A B Y 0 0 0 0 1 1 1 0 1 1 1 0

The SOP normal form for the XOR function is

Interpret this equality as \(Y=1\) if \(A=0\) AND \(B=1\) OR if \(A=1\) AND \(B=0.\) Negate predicates \(B=1\) and \(A=1\) to \(\overline{B}=0\) and \(\overline{A}=0,\) to simplify the argument for the switch model of pMOS transistors. Then, we obtain the equivalent XOR logic:

\(Y=1\) if (\(A=0\) AND \(\overline{B}=0\)) OR if (\(\overline{A}=0\) AND \(B=0\)).

The Boolean expression consists of AND and OR operations, which translates directly into the series-parallel pull-up network in Figure 4.18(a). Given the pull-up network, we can derive the pull-down network by forming the dual, as shown in Figure 4.18(b). We transform the parallel arms of the pull-up network into a series composition of parallel nMOS transistors. The interactive switch model enables you to verify the truth table of the XOR gate.

A = 0 B = 0 Figure 4.18: 2-input XOR gate: (a) pull-up network derived from logic function, (b) the pull-down network is the dual of the pull-up network, and (c) interactive switch model.

The XOR gate in Figure 4.18 has the same pull-up network as our original XOR gate in Figure 4.15, up to the order of the pMOS gate inputs. However, the pull-down networks are different. To arrive at the XOR gate in Figure 4.15, consider the truth table of the XOR gate again, now with the goal to derive the pull-down network from the logic function. We see that \(Y=0\) if \(A=0\) AND \(B=0\) OR if \(A=1\) AND \(B=1.\) To fit the switch model of nMOS transistors, negate the predicates \(A=0\) and \(B=0\) to \(\overline{A}=1\) and \(\overline{B}=1,\) and we find the XOR logic:

\(Y=0\) if (\(\overline{A}=1\) AND \(\overline{B}=1\)) OR if (\(A=1\) AND \(B=1\)).

This Boolean expression translates directly into the series-parallel pull-down network in Figure 4.19(a). Deriving the dual of the pull-down network yields the pull-up network in Figure 4.19(b).

A = 0 B = 0 Figure 4.19: 2-input XOR gate: (a) pull-down network derived from logic function, (b) the pull-up network is the dual of the pull-down network, and (c) interactive switch model.

The XOR gate in Figure 4.19 has the same pull-down network as our original XOR gate in Figure 4.15, but a different pull-up network. However, we arrive at the XOR gate design in Figure 4.15 by combining the pull-up network of Figure 4.18 with the pull-down network of Figure 4.19. This design saves a wire and is simpler than the designs with dual networks, because the four arms are independent. Furthermore, the XOR gate demonstrates that CMOS gates do not necessarily have pull-up and pull-down networks that are duals of each other. The 3-input XOR gates and the majority gate are examples of CMOS gates whose pull-up and pull-down networks are not duals either.

#### XOR CMOS Circuits¶

The 2-input XOR gate in Figure 4.15 requires complemented and
uncomplemented inputs. In fact, we may specify this CMOS gate as a
logic function of four inputs, XOR\((A, B, C, D),\) with the
additional constraints \(C = \overline{A}\) and \(D =
\overline{B},\) that the driving circuit has to obey. The input
specification distinguishes the XOR gate from other CMOS gates, the
NAND and NOR gates for example. The 2-input NAND and NOR gates have
two inputs not only because they realize binary logic functions but
also in terms of the number of input wires that their CMOS
implementations have. In contrast, we define the XOR gate as a binary
logic function, draw the *XOR gate symbol* with two
inputs, but our CMOS gate in Figure 4.15 requires four inputs. The
reason is that there exists no 2-input CMOS gate for the 2-input XOR
logic function, because CMOS gates can implement monotonically
decreasing functions only. Hence, implementing a 2-input XOR logic
gate requires a CMOS circuit rather than a CMOS gate. In this
section, we discuss several CMOS implementations of the 2-input XOR
logic gate.

We begin with a straightforward extension of the CMOS gate in Figure 4.15. To obtain a 2-input XOR gate, we generate the complements of inputs \(A\) and \(B\) by means of inverters. The resulting implementation of the XOR logic gate is shown in Figure 4.20.

If we wish to use the 2-input XOR gate in Figure 4.20 as a
building block for the construction of larger circuits, we need to
characterize its delay as a function of the electrical effort. Since
the XOR logic gate consists of inverters and an XOR CMOS gate, we can
scale the inverters and the XOR CMOS gate independently to minimize
its delay. Since the XOR circuit is a reconverging branch, we could
apply the *2d-analysis method* to determine the
scale factors for a given electrical effort. However, a close look
reveals why a 2d-analysis may not yield the fastest design. First,
notice that the circuit is symmetric in inputs \(A\) and
\(B,\) if we use a symmetric XOR CMOS gate. Thus, the circuit has
one path of interest, say from input \(A\) to output \(Y.\)
Assume the inverter has input capacitance \(C_1.\) The XOR CMOS
gate imposes a load of one pMOS and one nMOS gate capacitance on each
leg of a fork driven by input \(A.\) Let \(C_2\) be the input
capacitance of inputs \(A\) and \(\overline{A}\) of the XOR
CMOS gate, then our XOR circuit includes a 1-fork with equal loads, as
shown in Figure 4.21.

Recall that we should avoid 1-forks, but let us ignore this lesson for
a moment. Instead, we insist on minimizing the delay of the path from
input to output in Figure 4.21 given electrical effort \(H.\)
Notice that the circuit does not permit equalizing the delays of the
legs of the fork. Instead, the path through the inverter leg adds the
inverter delay to the delay through the XOR CMOS gate. Thus, all we
can do by gate sizing is to minimize the delay of the slower leg with
the inverter. This leaves us with the optimization problem of how
much input capacitance to assign to the on-path inverter versus the
off-path XOR CMOS gate. As discussed in *fork design*, we introduce factor \(\alpha\) in range \(0 <
\alpha < 1\) to split the input capacitance between the inverter and
the XOR CMOS gate:

such that \(C_1 + C_2 = C_{in}.\) Then, the stage efforts of the inverter and the XOR CMOS gate are:

Now, the lower leg of the fork has delay

and the upper leg has delay

where \(D_u = (f_{inv} + p_{inv}) + D_l.\) We can minimize \(D_u\) by setting the derivative \(d D_u/ d \alpha\) to zero, and find

Thus when using the XOR logic gate as part of a larger circuit, we can choose \(\alpha\) depending on electrical effort \(H\) to minimize its delay.

If we wish to minimize the delay of the XOR logic gate further, we could make the XOR CMOS gate asymmetric such that the logical effort of the complemented input \(\overline{A}\) is smaller than that of the uncomplemented input, so as to speed up the path through the inverter leg. Alternatively, we may recall our lesson that a 2-fork is preferable to a 1-fork. Hence, our second attempt for an XOR logic gate uses 2-forks to generate the complemented and uncomplemented inputs, as shown in Figure 4.22.

We can analyze this XOR circuit with our *2d-analysis method*. The path of interest is shown in Figure 4.23, with effort delay \(d_1\) assigned to the inverters of
the 2-fork, and effort delay \(d_2\) for the XOR CMOS gate.

The 2d-analysis enables us to express effort delay \(d_2\) as a function of \(d_1\):

such that the path effort delay becomes

Minimizing the path effort delay by setting the derivative \(\partial D / \partial d_1\) to zero yields the polynomial in \(d_1\):

For a given electrical effort \(H,\) we determine delay \(d_1\) by finding the positive real root of the polynomial. Then, the minimum delay of the XOR logic gate is \(\hat{D} = d_1 + d_2 + P,\) where \(P = 2 p_{inv} + p_{xor2} = 6.\) Figure 4.28 below compares the delays of the two XOR gates with 1-fork and 2-fork inputs as a function of electrical effort \(H.\) Not unexpectedly, for all \(H\) the 2-fork design is faster than the 1-fork design. Furthermore the 1-fork design slows down much more rapidly than the 2-fork design for increasing \(H.\)

The speed difference between the 1-fork and 2-fork XOR gate designs
motivates the search for even faster XOR gate circuits. In the
following, we present three alternative designs. We begin with the
reconverging NAND circuit in Figure 4.24 on the left.
Using Boolean algebra, it is straightforward to verify that this
circuit implements the 2-input XOR function. During our study of
*reconverging branches*, we have
identified the 2-fork style modification in Figure 4.24 on the right as superior because it is faster. The comparison
of the delays in Figure 4.28 shows that the
reconverging NAND circuit is slower than the 2-fork XOR gate circuit
up to \(H = 47.\) For larger electrical efforts, the reconverging
NAND circuit is faster.

Since NAND gates have not only a relatively small logical effort but also small parasitic delay compared to other 2-input gates, a 2-level NAND circuit is another top contender for an XOR gate. Again, we may use a 1-fork or the 2-fork shown in Figure 4.25 to generate the complemented and uncomplemented inputs. The 2-fork is faster than the 1-fork, as usual. The resulting XOR circuit has four stages on the 2-inverter leg, so that we expect this circuit to be better suited for larger electrical efforts.

To determine the minimum delay as a function of electrical effort
\(H,\) we apply the *1d-analysis method* with
effort delay \(d\) assigned to the gates as shown in
Figure 4.26. The 1d-analysis of the circuit
yields the polynomial

Given electrical effort \(H,\) the positive real root of the polynomal equals effort delay \(d,\) for which the circuit has minimum path delay \(\hat{D} = 4 d + 6.\) Figure 4.28 plots \(\hat{D}\) over electrical effort \(H.\) We see that the 2-fork XOR gate is faster up to \(H = 14.\) For \(H \ge 15,\) the 4-stage NAND circuit is faster.

Compound gates enable us to implement two-level logic in a compact fashion. Thus, we may implement an XOR gate using OAI or AOI compound gates. Figure 4.27 shows on the left an XOR circuit design based on an OAI gate. This circuit uses a NAND gate that behaves like the inverter of a 1-fork driving the OAI gate. This circuit is indeed slower than the more complex version on the right of Figure 4.27 that uses a 2-fork structure at the inputs and a AOI gate.

We analyze the AOI circuit using a *2d-analysis*,
by assigning effort delay \(d_1/2\) to the NAND gate and the
inverter in the upper leg and effort delay \(d_1+2\) to the lower
leg of the fork, and effort delay \(d_2\) to the AOI gate. Then,
the 2d-analysis permits expressing \(d_1\) as a function of
\(d_2.\) Minimizing the path effort delay by setting its
derivative w.r.t. \(d_1\) to zero yields polynomial

Given electrical effort \(H,\) the positive real roots of the polynomial yield minimum path delay \(\hat{D} = d_1 + d_2 + 16/3,\) which is plotted in Figure 4.28 as a function of \(H.\) We find that the AOI circuit has competitive delays xacross the range of \(H,\) but for no \(H\) is the AOI circuit the fastest XOR gate.

The comparison in Figure 4.28 suggests that the fastest XOR implementation depends on electrical effort \(H.\) In particular, among the studied alternatives, the 2-fork XOR circuit of Figure 4.22 is fastest for \(H<12,\) whereas for \(H \ge 12,\) the two-level NAND circuit of Figure 4.25 is fastest.

#### XNOR Gates¶

The XNOR gate is very similar to the XOR gate. Logically, the XNOR
gate produces the complement of the XOR gate. Thus, an \(n\)-input
XNOR gate implements the **even-parity** operation, which outputs 1 if
the number of inputs with value 1 is even. The truth table below
shows the XNOR function for \(n = 3,\) \(Y = \overline{A \oplus B
\oplus C},\) which equals 1 if the number of 1-inputs is 0 or 2.

A B C Y 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 1 1 1 0 1 1 1 1 0

The CMOS implementation of a 2-input XNOR gate is shown in Figure 4.29. Like the 2-input XOR gate, it consists of four arms. The pull-up and pull-down networks are not duals of each other. Noteworthy is the fact that XOR and XNOR CMOS gates have the same topology. In particular, we do not need an output inverter to implement the complement, as required for the AND and OR gates.

A = 0 B = 0 Figure 4.29: CMOS circuit for XNOR gate and interactive switch model.

Since the topology of the XNOR gate equals that of the XOR gate, both logical effort and parasitic delay of the XNOR gate are equal to those of the XOR gate:

XNOR gates with more than two inputs can be constructed analogous to the XOR gate. Also analogous to the XOR logic gate are the design issues of CMOS circuits for a 2-input XNOR logic gate.

### 4.1.6. Skewed Gates¶

When optimizing a circuit for speed, we may want the falling
transition of a signal from high to low voltage to be faster than the
rising transition from low to high voltage or vice versa. In this
section, we discuss **skewed gates** whose **critical transition** is
faster than the **noncritical transition**. In contrast, CMOS gates
with equal rising and falling delays are *unskewed* or *normal-skew*
gates. We distinguish between **HI-skew gates** and **LO-skew
gates**. In a HI-skew gate the rising output transition is the
faster, critical transition, and in LO-skew gates the falling output
transition is critical.

#### Design and Analysis¶

We can skew a gate by transistor sizing. For example, consider the matched 2-input NAND gate in Figure 4.30(a). By definition, the transistors of the matched NAND gate are sized to provide equal drive currents for the rising and falling transitions. Furthermore, the drive currents are equal to those of the reference inverter. Since the delay of a gate depends on its drive current, the rising and falling delays are equal if the magnitudes of the drive currents of the corresponding transitions are equal. The key insight for designing a skewed gate is to shrink the transistors in the CMOS network that drives the uncritical transition. For example, if we wish to speed up the rising transition of the NAND gate, we shrink the nMOS transistors of the pull-down network. The effect is that the pMOS transistors deliver the same drive current as the matched NAND gate on the rising transition, but smaller nMOS transistors reduce the logical effort of the gate, resulting in a faster transition. For example, the HI-skew NAND gate in Figure 4.30(b) halves the widths of the nMOS transistors compared to the matched NAND gate to speed up the rising transition.

To characterize the delay of the HI-skew NAND gate, we determine its logical effort and parasitic delay. However, due to the unmatched transistor sizes, logical effort and parasitic delay differ for the rising and falling transitions. We use \(g_u\) and \(p_u\) to denote the logical effort and parasitic delay of the rising output transition driven by the pull-up network, and \(g_d\) and \(p_d\) for the falling output transition driven by the pull-down network. Then the delays of the rising and falling transitions of a skewed gate are:

where \(h\) is the electrical effort of the gate.

We consider the rising output transition first. The output of a 2-input NAND gate transitions from 0 to 1 if initially both inputs \(A\) and \(B\) are 1, and one of the inputs switches from 1 to 0. The corresponding pMOS transistor in the pull-up network drives the output. Which one of the two pMOS transistors switches does not matter in this example, because the NAND gate is symmetric. Since the width of the pMOS transistors of the HI-skew gate equals the width in the matched NAND gate, both gates produce the same drive current. Furthermore, because the matched NAND gate has the same pull-up drive current as the reference inverter, we conclude that the HI-skew gate has the same pull-up drive current as the reference inverter. Now, recall our definition of the logical effort of a CMOS gate as the ratio of its input capacitance to that of the reference inverter, assuming that the CMOS gate is sized to deliver the same drive current as the reference inverter. Since the drive currents are equal, the logical effort \(g_u\) of the rising transition of the HI-skew gate is the ratio of the input capacitance of the HI-skew gate and the input capacitance of the reference inverter. According to Figure 4.30(b), each input has input capacitance \(C_{in}(hi) = W_p + W_n = 3,\) because \(W_p = 2\) and \(W_n = 1,\) so that

Shrinking the pull-down transistors of the HI-skew gate retains the drive current through the pull-up network while reducing the input capacitance. As a result logical effort \(g_u = 1\) is less than the logical effort of the matched gate \(g_{nand2} = 4/3.\) Similarly, the parasitic delay of the rising transition of the HI-skew gate is

We find that parasitic delay \(p_u\) is less than that of the matched gate \(p_{nand2} = 2.\)

Next, we determine \(g_d\) and \(p_d\) of the falling
transition of the HI-skew gate. To apply the definition of logical
effort, we construct the **down-scaled** NAND gate shown in
Figure 4.30(c). The down-scaled gate has the
same nMOS transistor widths as the HI-skew gate, so that their drive
currents are equal. We halve the width of the pull-up transistors, so
that the down-scaled gate is a scaled version of the matched NAND gate
with scale factor \(\gamma = 1/2.\) Now, we can argue that the
HI-skew NAND gate has the same pull-down drive current as the matched
NAND gate scaled by \(\gamma = 1/2,\) which in turn has the same
pull-down drive current as a reference inverter scaled by
\(\gamma = 1/2.\) Therefore, the logical effort of the falling
transition of the HI-skew gate is the ratio of the input capacitance
of the HI-skew gate and the input capacitance of the scaled reference
inverter:

for each input. The cost of the reduced logical effort \(g_u = 1\) of the rising transition is an increased logical effort \(g_d = 2\) of the falling transition, compared to \(g_{nand2} = 4/3\) of the normal-skew gate. Analogously, the parasitic delay of the falling transition of the HI-skew gate is

The rising and falling logical efforts and parasitic delays are related through scale factor \(\gamma\) of the down-scaled matched gate: \(g_u = \gamma g_d\) and \(p_u = \gamma p_d.\) Notice that scale factor \(\gamma = 2\) is also the scale factor by which we shrank the pull-down nMOS transistors of the HI-skew gate to begin with. This observation simplifies the design and analysis of a skewed gate based the model of logical effort. Below is the procedure to design and analyze a HI-skew gate; the procedure for a LO-skew gate can be viewed as dual:

- Choose a scale factor \(\gamma\) for the nMOS transistors of the pull-down network of the matched gate.
- The logical effort \(g_u\) of the rising transition is the ratio of the input capacitance of the HI-skew gate and the reference inverter. The falling transition has logical effort \(g_d = g_u / \gamma.\)
- The parasitic delay \(p_u\) of the rising transition is the ratio of the output capacitance of the HI-skew gate and the reference inverter. The falling transition has parasitic delay \(p_d = p_u / \gamma.\)

To facilitate a comparison between a HI-skew and normal-skew gates, we use the average of the rising and falling quantities:

The average provides an accurate measure of the transition delay of a gate over time. When the gate is in operation as part of a larger circuit, half of all output transitions are rising and the other half are falling transitions. For the HI-skew NAND gate in Figure 4.30, the average logical effort is \(g_{hi} = 3/2\) and the average parasitic delay \(p_{hi} = 5/2.\) Thus, although our HI-skew NAND gate has a faster rising transition than the unskewed gate, both average logical effort and parasitic delay are larger than \(g_{nand2} = 4/3\) and \(p_{nand2} = 2\) of the unskewed gate.

#### Fast Skewed Gates¶

Our observation that the HI-skew NAND gate has a larger average delay than the unskewed NAND raises the question whether a skewed gate exists that is faster than its unskewed version. This is the case indeed. The LO-skew 2-input NOR gate in Figure 4.31(b) has a smaller average logical effort and smaller average parasitic delay than the normal-skew NOR gate.

The LO-skew NOR speeds up the falling transition by shrinking the pull-up pMOS transistors. In Figure 4.31, we scale the pMOS transistors of the matched NOR gate with factor \(\gamma = 3/4\) to obtain transistor width \(W_p = 3.\) Since the nMOS transistors are unchanged compared to the matched NOR gate, the drive current of the falling transition through the pull-down network of the LO-skew gate is the same as the pull-down drive current of the matched NOR gate and the reference inverter. Therefore, the logical effort of the falling transition is \(g_d = C_{in}(lo) / C_{in}(inv) = 4/3.\) The logical effort of the rising transition is then \(g_u = g_d / \gamma = 16/9.\) Note that \(g_u\) is the ratio of the input capacitances of the LO-skew gate and the \(\gamma\)-scaled reference inverter, and \(\gamma=3/4\) is equal to the scale factor of the up-scaled matched NOR gate in Figure 4.31(c). The average logical effort of the LO-skew NOR gate is \(g_{lo} = 14/9,\) which is less than the logical effort \(g_{nor2} = 5/3\) of the normal-skew gate. Similarly, we find the parasitic delays \(p_d = 5/3\) and \(p_u = 20/9.\) The average parasitic delay of \(p_{lo} = 35/18\) units is less than the parasitic delay \(p_{nor2} = 2\) of the normal-skew NOR gate.

We conclude that for all electrical efforts \(h,\) the average delay of the LO-skew NOR gate, \(d_{lo}(h) = (14/9) h + 35/18,\) is slightly less than the delay of the normal-skew NOR gate, \(d_{nor2}(h) = (5/3) h + 2.\) We can formulate a minimization problem to determine scale factor \(\gamma\) for the LO-skew NOR gate such that the average delay assumes its minimum. Using calculus, we find that scale factor

minimizes the average delay if it is chosen as a function of electrical effort \(h.\) For \(h \ge 1,\) the range of \(\gamma\) is very small, \(1/2 < \gamma \le \sqrt{3/8}.\) Therefore, choosing \(\gamma \approx 1/2\) will approximate the minimum average delay reasonably well for all \(h.\) The resulting LO-skew NOR gate has pMOS widths \(W_p = \gamma 4 = 2.\) This gate is slightly faster than the matched NOR gate, because its average logical effort \(g=3/2\) is less than \(g_{nor2} = 5/3\) and the average parasitic delay \(p=2\) is the same. Before you decide to use LO-skew NOR gates in your circuits, however, notice that the speed benefit compared to the normal-skew NOR is relatively small. Furthermore, in many circuits, it is not the average delay but the worst-case delay that limits the overall performance. The worst-case delay of the LO-skew NOR is the delay of the noncritical rising transition, which is larger than the delay of the matched NOR gate.

#### Multistage Paths¶

The delay of an \(n\)-stage path with skewed gates depends on its output transition. Let index \(k,\) \(1 \le k \le n,\) denote the \(k^{th}\) gate from the last stage, then the path delays of the rising and falling output transition are:

Since CMOS gates implement monotonically decreasing functions, every second stage of the path transitions in one direction, and all other stages transition in the opposite direction. Therefore, if the output of gate 1 in the last stage rises, the outputs of all gates with odd index \(k\) rise, and the outputs of the gates with even \(k\) fall. Path delay \(D_u\) is the sum of the corresponding skewed gate delays. Analogously, if the output of the last gate of the path falls, the path delay is \(D_d.\) The average path delay is a function of the average logical efforts and parasitic delays:

If we wish to minimize the average path delay, we apply the method of
logical effort for *multistage paths* as we
know it already, except that we use the average logical efforts and
parasitic delays of the skewed gates for path and gate sizing. The
design procedure is the same as for unskewed gates. With skewed gates
we have an additional design goal at our disposal, which is to
minimize the delay of the critical transition.

Consider the 3-stage path in Figure 4.32 with a HI-skew NAND gate in stage 1, a LO-skew NOR gate in stage 2 and a normal-skew inverter in stage 3. Given the transistor widths in Figure 4.32 and load capacitance \(C_L = 6,\) determine the path delay of the critical rising transition, the path delay of the noncritical falling transition, and the average path delay.

Analysis of the gate logical efforts and parasitic delays yields the tabulated values.

\(g_u\) | \(g_d\) | \(g_{avg}\) | \(p_u\) | \(p_d\) | \(p_{avg}\) | |
---|---|---|---|---|---|---|

HI-skew NAND | 1 | 2 | 3/2 | 5/3 | 10/3 | 5/2 |

LO-skew NOR | 2 | 1 | 3/2 | 8/3 | 4/3 | 2 |

NOT | 1 | 1 | 1 | 1 | 1 | 1 |

The average path delay from input \(A\) or \(B\) to output \(Y\) is the sum of the average gate delays:

The minimum delay is the delay of the critical transition. The critical transition of the path is the rising output transition, which the path speeds up with the HI-skew NAND gate followed by the LO-skew NOR gate. The rising delay is

The maximum delay is the delay of the falling output transition

We find that the critical transition of the path is 2.5 delay units faster than the average delay, at the expense of the noncritical transition which is 2.5 delay units slower.

We close this section with a brief note on another practically
relevant application of skewed gates, i.e. the design of gates with
equal rising and falling delays if the ratio of the carrier mobilities
of the CMOS fabrication process is not \(\mu_n/\mu_p = 2.\)
Throughout this text, we assume that the mobility ratio is
\(\mu_n/\mu_p = 2.\) Value 2 is convenient for
back-of-the-envelope estimates, but is merely an approximation to
reality, where typical values are in range \(2 \lesssim
\mu_n/\mu_p \lesssim 3.\) In CMOS processes where \(\mu_n/\mu_p
\ne 2,\) we may skew the gates to equalize the rise and fall times.
The *model of logical effort* enables to determine
the desired transistor widths, see Chapter 7 in [SSH99].

## 4.2. Comparators¶

Comparator circuits compare two \(n\)-bit signals \(A\) and
\(B.\) The *magnitude comparisons*, including \(A > B\) or
\(A \le B,\) are commonly implemented with an *arithmetic
circuit*. However, for two special cases,
equality comparison and equality to zero, simpler circuits exist.

### 4.2.1. Equality to Zero¶

Given an \(n\)-bit signal \(A,\) we wish to determine whether \(A = 0\) is true or false. Equality \(A = 0\) is true, if for each signal \(A_i,\) \(0 \le i < n,\) we have \(A_i = 0.\) Thus, we may use an \(n\)-input NOR gate to compute the equality to zero:

Note that value \(0\) in equality \(A=0\) refers to an n-bit signal of 0’s, whereas the \(1\) in the NOR equality denotes a single bit, the output of the NOR gate.

For larger values of \(n,\) we may use a *tree-structured* NOR gate to minimize the delay. Figure 4.33 shows one possible tree-structure for \(n = 4,\) which
is due to the Boolean identity:

As for any tree structure, the fastest design for an \(n\)-bit equality-to-zero comparator depends on the number of inputs \(n.\)

### 4.2.2. Equality Comparator¶

Given two \(n\)-bit signals \(A\) and \(B,\) we wish to determine whether \(A = B\) is true or false. The equality-to-zero comparison may be viewed as a special case of the equality comparison where \(B\) is identical to zero. Equality \(A = B\) is true if for each bit position \(i\) we have \(A_i = B_i.\) More precisely:

Recall that the *2-input XNOR gate* realizes the
equality relation, \(A_i = B_i\ \Leftrightarrow\ \overline{A_i
\oplus B_i} = 1.\) Thus, we may implement an equality comparator using
one 2-input XNOR gate for each bit position, and an \(n\)-input
AND gate to compute the all quanitfication by means of a conjunction:

Output \(Y = 1\) if \(A = B.\) For larger values of
\(n,\) we implement the AND gate as a *tree structure*. For example, Figure 4.34 shows a
4-bit equality comparator with a NAND-NOR tree implementation of the
4-input AND gate on the right.

## 4.3. Multiplexer¶

A **multiplexer**, or **mux** for short, is a circuit with \(n =
2^k\) data inputs \(D_i,\) \(0 \le i < n,\) and \(k\)
select inputs \(S_j,\) \(0 \le j < k,\) that choose one
of the \(n\) data inputs and steer it to output \(Y.\)

For example, a 2-input multiplexer, also called **2:1 mux** to
emphasize 2 inputs and 1 output, has \(n=2\) data inputs,
\(D_0\) and \(D_1,\) and \(k=1\) select input \(S.\)
Output \(Y\) is defined such that

The circuit symbol of the 2:1 mux signifies that select signal
\(S\) steers input \(D_0\) to output \(Y\) if \(S =
0,\) and otherwise input \(D_1.\) We can model the functional
behavior of a 2:1 multiplexer using a switch with two closed
positions. Note that this switch model differs from our
*standard model*, where a switch is either open or
closed.

To design a circuit for 2:1 mux, we first construct a truth table. The 2:1 mux has three inputs, \(S,\) \(D_1,\) and \(D_0,\) and one output \(Y.\) We transliterate the specification logic straight into the truth table: if \(S=0,\) then set output \(Y = D_0\) (top four rows), else if \(S=1\) set \(Y = D_1\) (bottom four rows):

S D1 D0 Y 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1

Output \(Y\) of this truth table is represented by the Boolean expression:

This expression suggests an implementation of the 2:1 mux using logic gates as shown on the right. The critical path of this circuit is a 3-stage path from input \(S\) through the inverter, one AND gate and the OR gate to output \(Y.\)

A faster implementation of the 2:1 mux can be based on a
*compound gate*, or with the CMOS circuit shown
in Figure 4.35. Since the circuit produces the
complemented output \(Y = \overline{\overline{S} D_0 + S D_1}\) it
implements an *inverting 2:1 mux*. Each data input drives one vertical
**select arm** consisting of two nMOS and two pMOS transistors. The
complemented and uncomplemented select inputs enable one of the select
arms to drive output \(Y.\) If select input \(S=0,\) the left
arm is enabled, and if \(S=1\) the right arm. The output value
depends on the corresponding data input, \(D_0\) for the left arm
and \(D_1\) for the right arm.

S = 0 D1 = 0 D0 = 0 Figure 4.35: CMOS circuit for inverting 2:1 mux and interactive switch model.

The inverting 2:1 multiplexer in Figure 4.35 has
a remarkably flexible structure that extends to \(n\) data inputs,
\(n > 0,\) even if \(n\) is not a power of 2. If we are
willing to generate one select signal per data input, e.g. by means of
a decoder, we can replicate the select arm \(n\) times to
construct the \(n\)-way multiplexer in Figure 4.36.
Select arm \(i\) drives the complement of \(D_i\) onto output
\(Y,\) if select input \(S_i = 1.\) All other select signals
\(S_j,\) \(j \ne i,\) must be \(S_j = 0.\) Otherwise, two
enabled select arms might drive different values on output \(Y,\)
effectively short circuiting \(V_{DD}\) and *GND*.

Since each select arm consists of two series pMOS transistors in the pull-up network and two series nMOS transistors in the pull-down network, the matched \(n\)-way multiplexer has pMOS width \(W_p = 4\) and nMOS width \(W_n = 2.\) Therefore, the logical effort of each data input is \(g_{mux}(D_i) = 2,\) independent of the number of inputs. This property is unique feature of the multiplexer circuit. Unfortunately, the parasitic delay \(p_{mux}(n) = 2 n\) grows proportional to the number of inputs \(n.\) Nevertheless, we note that the effort delay of the multiplexer circuit is independent of its fan-in.

For larger numbers of inputs \(n,\) the parasitic delay of the
\(n\)-way circuit in Figure 4.36 may dominate the total
delay. In this case, *tree-structured* multiplexers
provide a fast alternative. As an example, consider a 4:1 multiplexer
with four data inputs and two select inputs, as shown on the left in
Figure 4.37. The 4:1 multiplexer steers data input
\(D_i\) to output \(Y\) if the select signal equals \(i\)
in binary format. Note that \(k\) select inputs enable us to
select one of \(n = 2^k\) data inputs, because a binary number
with \(k\) bits can represent unsigned numbers in range \([0,
2^k-1].\) In particular, each of the \(k\) select signals may be
used as select input for one of \(k\) levels in a binary tree of
2:1 muxes with \(n = 2^k\) data inputs. The multiplexer tree on
the right in Figure 4.37 has \(4 = 2^k\) inputs and
\(k=2\) levels.

The functionality of the 4:1 multiplexer is easy to verify by perfect induction. For example, if \(S = 01,\) i.e. \(S_1 = 0\) and \(S_0 = 1,\) then the level-1 output mux steers its 0-input to \(Y,\) because \(S_1 = 0.\) The 0-input of the level-1 mux is driven by the top level-0 mux, which steers \(D_1\) to its output, because \(S_0 = 1.\) Therefore output \(Y = D_1\) for \(S = 01.\) We argue analogously about the other cases of the perfect induction:

\(S_1\) \(S_0\) \(Y\) 0 0 \(D_0\) 0 1 \(D_1\) 1 0 \(D_2\) 1 1 \(D_3\)

A delay analysis reveals that a \(4\)-way mux with independent select arms has a delay of \(D_{4way} = 2 H + 8\) per data input, whereas the 4:1 tree mux with two 2-way muxes on its critical path has a minimum delay of \(\hat{D}_{tree} = 4 \sqrt{H} + 8\) delay units per data input. Thus, the tree mux is faster than the 4-way circuit for electrical effort \(H > 4.\) This delay analysis neglects the fact that the 4-way mux circuit is inverting whereas the tree mux is not. We may also design tree-structured multiplexers with 4-way or 8-way muxes as building blocks. Due to the flexibility of the \(n\)-way mux circuit, the design space for tree-structured muxes is quite large. It has been shown, however, that 4-way muxes as building blocks produce generally the fastest tree structures for \(n \ge 8\) data inputs. Chapter 11 of [SSH99] studies tree structures in more detail.

## 4.4. Tristate Inverter and Transmission Gates¶

A single select arm of the multiplexer circuit in Figure 4.36
is a **tristate inverter**. The name *tristate* refers to the fact
that the output can assume a third state in addition to the usual
binary states 0 and 1. In the third state, output \(Y\) of the
tristate inverter is connected neither to \(V_{DD}\) nor *GND*.
Therefore, the output has an undetermined voltage. We say the output
**floats**, and denote this third state with letter **Z**.
Figure 4.38 shows that the output of the
select arm floats if select signal \(S = 0.\) Otherwise, if
\(S = 1,\) the arm functions like an inverter, i.e. \(Y =
\overline{D}.\)

S = 0 D = 0 Figure 4.38: Interactive switch model of one multiplexer select arm.

In a multiplexer circuit, all but one select arm float. The selected arm does not float, and drives output \(Y.\) In a figurative sense, the floating arms shut up, leaving the word to the only selected arm. In an \(n\)-way multiplexer with \(n = 2^k\) arms, there is always one selected arm, and the output never floats. In general, this is the expected behavior from a multiplexer, that an implementation with logic gates produces as well. Tristate inverters expand the applicability of select arms, for example as drivers for shared buses where all arms may float.

To arrive at the traditional circuit realization of a tristate inverter, we play the following circuit trick. Note that we can introduce a new wire in the select arm, shown in Figure 4.39(b), without affecting its functionality. Now, imagine we could pull output \(Y\) to the right like a rubber band. Then, we obtain the topologically equivalent circuit in Figure 4.39(c).

The parallel composition of an nMOS and pMOS transistor in
Figure 4.39(c) is a **transmission gate**
with the function:

The transmission gate disconnects its two terminals, if \(S=0,\) and connects its two terminals if \(S = 1.\) In Figure 4.39(c), terminal \(X\) is driven by an inverter, such that \(X = \overline{D}.\) If \(S = 0,\) output \(Y\) is disconnected from \(X,\) and floats. Otherwise, if \(S=1,\) output \(Y\) is connected to terminal \(X,\) and the inverter drives output \(Y = \overline{D}\) through the transmission gate.

This circuit is called tristate inverter, and is commonly defined as a
2-input gate, with a distinguished *enable* input. The symbol of the
tristate inverter is shown on the left. It assumes that the enable
signal is complemented internally. The function of the tristate
inverter is defined as

To analyze the logical effort of the tristate inverter in
Figure 4.39(c), we need to understand the
pass characteristics of the transmission gate. To that end, we
introduce a refinement of our simple transistor *switch model*. The key to understanding the transmission gate is
the **threshold voltage** \(V_t,\) which is a characteristic
process parameter of a transistor. For today’s nMOS transistors,
\(V_{tn} = +V_t\) is in range \(V_{DD}/4 \lesssim V_t
\lesssim V_{DD}/3\) and for pMOS transistors \(V_{tp} \approx
-V_t\) has opposite polarity.

An nMOS transistor is switched off if the voltage between gate and
source is smaller than the threshold voltage, \(V_{gs} < V_t.\)
This cut-off effect influences the current between source and drain,
\(I_{ds},\) as well as the terminal voltages in a more
differentiated fashion than our *simple switch model* suggests. Figure 4.40 shows
two cases where the source terminal of the nMOS transistor is
connected to *GND* (left), and where the drain terminal is connected
to \(V_{DD}\) (right). We assume the gate voltage of the nMOS
transistor equals \(V_g = V_{DD}.\) In our simple switch model
\(V_g = V_{DD},\) i.e. g = 1 in the digital abstraction, closes
the nMOS transistor. Now, consider the case where the source is tied
to *GND*, \(V_s = 0,\) see Figure 4.40 on
the left. The voltage between gate and source is \(V_{gs} = V_g
- V_s = V_{DD},\) which is larger than threshold voltage \(V_t.\)
The current \(I_{ds}\) flowing from drain to source depends on
drain voltage \(V_d\) and on-resistance \(R_n\) of the
transistor, such that \(I_{ds} = V_d / R_n\) according to our
*transistor RC model*. In case where \(I_{ds} =
0,\) the transistor pulls the drain voltage to ground, i.e. \(V_d
= V_{ds} = 0.\) This is the behavior of an nMOS transistor that we
know from the pull-down network of the CMOS inverter, for example.

In Figure 4.40 on the right, both gate and
drain terminals of the nMOS transistor are tied to \(V_{DD},\) so
that \(V_g = V_d = V_{DD}.\) In this circuit, current
\(I_{ds}\) depends on source voltage \(V_s.\) Since the
transistor is switched on only if \(V_{gs} \ge V_t,\) the source
voltage of the closed transistor cannot rise to \(V_{DD}\) without
external force. Instead, \(V_{gs} \ge V_t\) implies \(V_{ds}
\ge V_t\) by *KVL*, so that \(V_s \le V_{DD} - V_t.\)
The source-drain voltage suffers the so-called **threshold drop**,
which effectively reduces on-current \(I_{ds}.\) In the digital
abstraction we say that the nMOS transistor pulls the source terminal
to a **weak 1** at the drain, because the maximum source voltage is by
\(V_t\) smaller than drain voltage \(V_{DD}.\) In contrast,
shown in Figure 4.40 on the left, the nMOS
transistor pulls the drain terminal to a **strong 0** at the source,
because the minimum drain voltage equals source voltage 0. This
behavior is the reason why we use nMOS transistors in pull-down but
not in pull-up networks of CMOS circuits.

The analogous effect occurs in pMOS transistors, where all polarities
are reversed. The threshold voltage of a pMOS transistor is negative,
so that a pMOS transistor is closed if \(V_{gs} \le -V_t,\) and is
open if \(0 \ge V_{gs} > -V_t.\) The pMOS transistor pulls the
source terminal to a **weak 0** at the drain, because the minimum
source voltage is by \(V_t\) larger than \(V_d = 0,\) see
Figure 4.40 on the left. However, a pMOS
transistor pulls the drain terminal to a **strong 1** at the source,
because the maximum drain voltage is \(V_{DD}\) if \(V_{gs} =
-V_{DD}\) which is smaller than \(-V_t,\) see Figure 4.40 on the right. Thus, pMOS transistors a suited for
pull-up but not for pull-down networks of CMOS circuits.

Figure 4.41 shows the **pass transistor model** as
a convenient abstraction of the refined switch model. We say that an
nMOS transistor passes a strong 0 from input (source) to output
(drain), but a weak 1. Analogously, a pMOS transistor passes a
strong 1 from input (drain) to output (source), but a weak 0. Since
transistors are symmetric, source and drain can be used
interchangeably as inputs or outputs. According to the pass
transistor model we can use either an nMOS or a pMOS transistor as a
switch to pass an input signal to the output. However, neither passes
both inputs 0 and 1 strongly to the output. The transmission gate
uses a parallel composition of nMOS and pMOS transistors, so that one
of the transistors passes the input strongly, see
Figure 4.42.

When a transistor passes a strong 0 or 1, it acts like a closed switch
in the *RC model*. Current \(I_{ds}\) is
determined by the on-resistance. However, when passing a weak 0 or 1,
current \(I_{ds}\) is effectively smaller than passing a strong 0
or 1. We model this behavior by means of an increased on-resistance.
As a reasonable estimate, we assume that the on-resistance is twice as
large when passing a weak versus a strong input. Then, given
on-resistances \(R_n\) and \(R_p\) of the simple switch model,
we define the weak and strong on-resistances as

The on-resistances enable us to model the transmission gate as a resistive switch circuit, as shown in Figure 4.43.

Given unit-sized transistors, we have \(R_p = 2 R_n,\) because of mobility ratio \(\mu_n / \mu_p = 2.\) Then effective on-resistance for passing 0 is:

and for passing 1:

Since the difference is small, we approximate the on-resistance of the transmission gate with unit-sized transistors to be \(R_n\) in both cases when passing 0 or 1. With this approximation, we arrive at the matched tristate inverter shown on the right. The equivalent tristate inverter circuit in Figure 4.39(c) consists of a series composition of an inverter and a transmission gate. By doubling the size of both, the tristate inverter has the same pull-up and pull-down drive currents as the reference inverter. Thus, the logical effort of the tristate inverter is \(g_{tri}(A) = 2\) for input \(A,\) and \(g_{tri}(EN) = g_{tri}(\overline{EN}) = 2/3\) for the complemented and uncomplemented enable inputs. The parasitic delay is \(p_{tri} = 4/3.\)

## 4.5. Encoder¶

In the broad sense, an encoder is a circuit that transforms its inputs
into a code word of a given code. In the narrow sense of digital
circuits, an *encoder* commonly denotes a circuit that transforms a
one-hot coded word into a binary coded word. In this section, we
discuss the one-hot to binary encoder and another useful circuit, the
*priority encoder*.

### 4.5.1. One-Hot to Binary Encoder¶

The **one-hot code** with \(n\) bits \(x_i,\) where \(0
\le i < n,\) sets bit \(x_i = 1\) in code word \(X(i),\) and
all other bits \(x_j = 0\) for \(j \ne i.\) For example, the
one-hot code with \(n=4\) bits defines 4 code words \(X(i) =
x_3 x_2 x_1 x_0\) for \(0 \le i < 4\):

\(i\) \(x_3\) \(x_2\) \(x_1\) \(x_0\) 0 0 0 0 1 1 0 0 1 0 2 0 1 0 0 3 1 0 0 0

The one-hot to binary **encoder** is a circuit with \(2^n\) inputs
\(A_i,\) where \(0 \le i < 2^n,\) and \(n\) outputs
\(Y_j,\) where \(0 \le j < n,\) such that output \(Y = k\)
in binary format if input \(A_k = 1.\) For example, this truth
table defines a 4:2 encoder for \(n=2\):

\(A_3\) \(A_2\) \(A_1\) \(A_0\) \(Y_1\) \(Y_0\) 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 0 0 1 1

Figure 4.44 shows two alternative circuits for the 4:2 encoder.

When building larger encoders, each of the \(n\) outputs requires an OR or NOR gate with \(2^{n-1}\) inputs. The load presented to the inputs of the encoder differs substantially, between driving just one and driving \(n\) output gates. For example, a 16:4 encoder based on OR gates has the output equations:

Input \(A_1\) drives only one OR gate of output \(Y_0.\) In contrast, input \(A_{15}\) is connected to all four OR gates. Therefore, the electrical effort of \(A_{15}\) is 4 times larger, and signal changes on input \(A_{15}\) suffer a larger delay than on input \(A_1.\) The method of logical effort enables us to assess whether we can speed up the slow inputs, for example by inserting buffers. However, rather than equalizing the delays of the encoder inputs, we prefer holding the driving circuit responsible for coping with the various load capacitances presented by the encoder inputs.

### 4.5.2. Priority Encoder¶

An \(n\)-bit **priority encoder** is a circuit with \(n\) inputs
\(A_i\) and \(n\) outputs \(Y_i,\) where \(0 \le i <
n,\) such that

Informally, a priority encoder identifies the first in the sequence of input bits with value 1. For example, here is the truth table of a 3-bit priority encoder:

\(A_0\) \(A_1\) \(A_2\) \(Y_0\) \(Y_1\) \(Y_2\) 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 1 1 0 0

The inputs \(A_i\) are ordered according to their index \(i.\) Output \(Y_i = 1,\) if \(A_0 = 0\) and \(A_1 = 0\) up to \(A_{i-1} = 0\) and input \(A_i = 1.\) The values of the remaining inputs \(A_{i+1}, \ldots, A_{n-1}\) does not matter for output \(Y_i.\) The circuit on the left of Figure 4.45 implements this logic for output \(Y_i\) with an \((i+1)\)-input AND gate, except for \(Y_0.\)

The daisy chain circuit on the right of Figure 4.45 uses less area than the AND gate circuit on the left, at the
expense of a larger propagation delay. As a rough estimate of the
area requirements, let us count inverters and 2-input gates to assess
the area requirements of the two circuits. Given 2-input gates only,
we implement all larger AND gates of the circuit on the left as
*tree gates*. Furthermore, assume that we make the
complemented and uncomplemented inputs available with 2-forks,
resulting in three inverters per input, for a total of \(3 n\)
inverters for \(n\) 2-forks. A tree-structured AND-gate with
\(k\) inputs requires approximately \(k/2 + (k/2-1) \approx
k\) 2-input NAND gates and the same number of inverters. Thus, all of
the AND gates of an \(n\)-bit priority encoder together require
roughly

2-input NAND gates and inverters, respectively. Therefore, the total number of gates of the priority inverter on the left in Figure 4.45 is approximately \(n^2/2\) 2-input NAND gates and \(n^2/2 + 3 n\) inverters. Asymptotically, for large \(n,\) the number of gates grows quadratically, a scary fact. But, the circuit can be quite fast, because the critical path is proportional to the height of the largest AND gate of output \(Y_{n-1},\) which consists of about \(\lg n/2\) 2-input AND gates only.

The daisy chain circuit on the right of Figure 4.45 uses only a linear number of gates. Each inner stage of the chain consists of two NAND gates and three inverters. The first stage needs only on inverter and the last stage one NAND gate and one inverter. Therefore, the total number of gates of an \(n\)-bit priority encoder with daisy chain structure is \(2 (n-2) + 1 = 2 n - 3\) 2-input NAND gates and \(3 (n-2) + 1 = 3 n - 5\) inverters. For large \(n,\) the daisy chain occupies significantly less area than the AND gate design. The price we pay for the area efficiency is the propagation delay of the critical path, which stretches from input \(A_0\) to output \(Y_{n-1}.\) There are \(n-1\) NAND gates and \(n\) inverters on this path. A path with \((2n-1)\) stages requires a path effort of about \(3.59^{2n-1}\) to be optimally sized. For large \(n,\) this path effort would be humongous. In all likelihood, practical designs will have to settle on a suboptimal delay that is dominated by the gates on the critical path.

If we wish to design a wide priority encoder with a large number of inputs and outputs \(n,\) then the AND gate design is not an option because of its quadratic area requirements, and the daisy chain would be slow. Thus, we may want to reduce the number of gates on the chain. We can halve the number of gates by using the inverting chain in Figure 4.46. The NAND gate in odd stage \(2i-1\) of the chain computes the complement \(\overline{\overline{a_0} \cdot \overline{a_1} \cdot \ldots \cdot \overline{a}_{2i-1}}\) and the NOR gate in even stage \(2i\) of the chain computes the uncomplemented conjunction \(\overline{a_0}\cdot \overline{a_1} \cdot \ldots \cdot \overline{a}_{2i}\) of the complemented input signals.

We wish to size the gates of the 7-bit priority encoder in Figure 4.46 in order to minimize its delay. We proceed in two steps. First, we obtain a rough estimate for the delay of the chain assuming fixed gate sizes, and, second, we size the gates on the daisy chain according to the method of logical effort.

To get a feel for the delay of the daisy chain assume the load
capacitance of each output is \(C_L(Y_i) = 36.\) Furthermore,
all gates shall be scaled by factor \(\gamma = 2.\) Thus, each
inverter has input capacitance \(C_{inv} = 6,\) each NAND gate
input presents capacitance \(C_{nand} = 8\) and each NOR gate
input \(C_{nor} = 10.\) Our path of interest is the critical
path from input \(A_0\) to output \(Y_6.\) The logical
effort of the path with 1 inverter, 3 NAND gates, and 3 NOR gates
is \(G = (4/3)^3 (5/3)^3 = 10.97.\) The branching effort of
all stages except the last is \(b = 2,\) for a path branching
effort of \(B = 2^5.\) The electrical effort is \(H =
C_L/C_{in} = 36/6 = 6.\) Thus, the path effort of the critical path
is \(F = GBH = 2{,}107.\) According to the *path sizing
calculator*, the best number of stages of
an inverter chain for this stage effort is \(n^\star = 6.\)
Thus, the number of stages on the critical path of our daisy chain,
\(n=7,\) is by no means unreasonable. The delay computation of
our priority encoder design is detailed in
Table 4.1.

\(\mbox{stage}\) | \(C_{in}\) | \(C_{out}\) | \(h\) | \(g\) | \(f\) | \(p\) |
---|---|---|---|---|---|---|

1 | 6 | 16 | 8/3 | 1 | 2.67 | 1 |

2 | 8 | 20 | 5/2 | 4/3 | 3.33 | 2 |

3 | 10 | 16 | 8/5 | 5/3 | 2.67 | 2 |

4 | 8 | 20 | 5/2 | 4/3 | 3.33 | 2 |

5 | 10 | 16 | 8/5 | 5/3 | 2.67 | 2 |

6 | 8 | 10 | 5/4 | 4/3 | 1.67 | 2 |

7 | 10 | 36 | 3.6 | 5/3 | 6 | 2 |

total | 22.33 | 13 |

Table 4.1 suggests obvious improvements to our circuit. To obtain minimum delay, all stages should bear the same effort delay. Stages 6 and 7 present outliers with \(f_7=6\) and \(f_6 = 1.67\) deviating the most from the average \(22.33/7=3.2.\) We can fix the imbalances between the stage efforts by gate sizing. To minimize the delay of the chain, we should spread path effort \(F = 2{,}107\) across the \(n=7\) stages of the chain. Thus, with an effort delay of \(\hat{f} = 2{,}107^{1/7} = 2.98\) per stage, the total effort delay would be \(n \hat{f} = 20.9\) time units. The resulting improvement compared to our initial design is less than \(7\,\%.\) However, we may even do better if we also resize the output gates.

The lesson we have learned from branching circuits is that the delay on the path of interest can be reduced either by decreasing the off-path capacitance or by increasing the on-path capacitance or both. The method of logical effort also teaches us that each stage of a fast design should bear effort \(f^\star = 3.59.\) If the NOR gates that drive off-path outputs \(Y_2\) and \(Y_4\) bear best stage effort \(f^\star = g h = (5/3) (36/C_{in}(nor)),\) they should have input capacitance \(C_{in}(nor) = 16.7.\) This is almost twice of our initial assumption, and increases the branching effort compared to our initial design. We lower the branching effort of the path of interest by reducing the NOR gate sizes of outputs \(Y_2\) and \(Y_4\) from the original value of \(C_{in}(nor) = 10\) slightly to \(C_{in}(nor) = 8.\) The off-path AND gates, i.e. the NAND-INV pairs, that drive outputs \(Y_1,\) \(Y_3,\) and \(Y_5\) have two stages each. Thus, if both NAND gate and inverter bear best stage effort \(f^\star,\) then the NAND gates can have input capacitance \(C_{in}(nand) = 3.72.\) If we use minimum sized NAND gates with \(C_{in}(nand) = 4,\) we can reduce the branching effort of the path. As a result, we obtain the critical path of the priority encoder shown below, where the input capacitances \(C_2, \ldots, C_7\) of the NAND and NOR gates are yet to be determined:

In general, to obtain the minimum delay for such a problem, we need to iterate the gate sizing. For the first iteration, we proceed as follows. We assume that the branching efforts in each stage is \(b = 2.\) Then, we know path effort \(F=2{,}107\) already, and the stage effort for minimum delay should be \(\hat{f} = 2.98,\) which is reasonably close to \(f^\star = 3.59,\) because the number of stages is close to the optimum. Working from the output towards the input, we can compute the input capacitances of each gate based on the relation for \(\hat{f} = g_i h_i = g_i C_{out}/C_{in}\):

\(i\) | \(1\) | \(2\) | \(3\) | \(4\) | \(5\) | \(6\) | \(7\) |
---|---|---|---|---|---|---|---|

\(C_i\) | 3.45 | 6.29 | 6.06 | 6.83 | 7.27 | 9.0 | 20.1 |

\(b_i\) | 1.64 | 2.32 | 1.58 | 2.1 | 1.44 | 1 | 1 |

Given the gate input capacitances, we can compute the branching efforts, and find that our assumption of branching effort \(b_i = 2\) is almost fulfilled, but only on average. The branching effort of our sized path is \(B' = 18.24\) rather than 64. Thus, given \(B' = 18.24\) the path effort is \(F' = (4/3)^3 (5/3)^3 B' 36/6 = 1{,}201\) only, and the stage effort for minimum path delay should be \(\hat{f} = F'^{1/7} = 2.75\) rather than assumed value \(\hat{f} = 2.98.\) Nevertheless, our design is pretty close to the minimum delay without further iterations already. Table 4.2 summarizes the resulting delays.

\(\mbox{stage}\) | \(C_{in}\) | \(C_{out}\) | \(h\) | \(g\) | \(f\) | \(p\) |
---|---|---|---|---|---|---|

1 | 6 | 10.29 | 1.71 | 1 | 1.71 | 1 |

2 | 6.29 | 14.06 | 2.23 | 4/3 | 2.98 | 2 |

3 | 6.06 | 10.83 | 1.79 | 5/3 | 2.98 | 2 |

4 | 6.83 | 15.28 | 2.23 | 4/3 | 2.98 | 2 |

5 | 7.27 | 13.01 | 1.79 | 5/3 | 2.98 | 2 |

6 | 9.01 | 20.13 | 2.23 | 4/3 | 2.98 | 2 |

7 | 20.13 | 36 | 1.79 | 5/3 | 2.98 | 2 |

total | 19.6 | 13 |

We find that gate sizing reduces the delay of the original priority encoder from \(35.3\) to \(32.6\) time units. This result confirms that our initial attempt was pretty fast already.

## 4.6. Decoder¶

A decoder computes the inverse transform of an encoder. In the
realm of digital circuits, the term *decoder* commonly refers to a
circuit that transforms a binary code into a *one-hot code*.

A binary to one-hot **decoder** is a circuit with \(n\) inputs
\(A_i,\) where \(0 \le i < n,\) and \(2^n\) outputs
\(Y_j,\) where \(0 \le j < 2^n,\) that asserts output
\(Y_k\) if the binary coded value of the input equals \(k.\)
For example, the truth table below specifies a 2:4 decoder with \(n=2\)
inputs and \(2^n = 4\) outputs, such that output \(Y_k = 1\)
if input \(A = k\) for \(0 \le k < 4.\)

\(A_1\) \(A_0\) \(Y_3\) \(Y_2\) \(Y_1\) \(Y_0\) 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 0 0

Figure 4.47 shows the implementation of the 2:4 decoder. Since each output represents one minterm, we use one 2-input AND gate per output. In general, for a \(n{:}2^n\) decoder, the circuit requires \(2^n\) AND gates with \(n\) inputs each. Each complemented and uncomplemented input drives \(2^{n-1}\) AND gates.

The primary use of decoders is as *address decoders* for
*memories*. For example, a register file with 8
registers employs a 3:8 decoder to select 1-out-of-8 registers. The
address input is 3 bits wide, with a binary code to cover address
range \([0, \ldots, 7].\) The 3:8 decoder transforms the binary
address into a one-hot code that asserts exactly 1 of its 8 outputs
associated with the selected register. A gate-level implementation of
a 3:8 decoder is shown in Figure 4.48 on the left. We use
2-forks to provide each input in complemented and uncomplemented
polarity. Since each input drives 4 NAND gates, the loads of the legs
are equal. Assuming also that the loads of the outputs \(Y\) are
equal, the decoder circuit is symmetric. Thus, we may minimize the
delay of the decoder by analyzing the path of interest, actually the
fork of interest, shown on the right-hand side of Figure 4.48. Each input has capacitance \(C_{in} = C_1 + C_2.\) The
load capacitance at each output shall be \(C_L = H C_{in}.\) Each
leg of the fork drives 4 NAND gates, one of which is on the path of
interest, whereas the others branch off the path of interest.

We use the *1d-analysis method* to minimize the
delay of the circuit. The delays allocated to the individual gates
are included in Figure 4.48. For the upper leg of the
fork, we multiply the effort delays to obtain:

because \(g_{nand3} = 5/3.\) For the lower leg, the partial products of the effort delays yield:

Since \(C_{in} = C_1 + C_2,\) we obtain the polynomial in \(d\):

Given electrical effort \(H,\) we can determine effort delay \(d(H)\) as the larger of two positive real roots of the polynomial. The minimum delay of the 3:8 decoder is then \(\hat{D} = 3 d(H) + P,\) where \(P = 6.\) Figure 4.49 plots the minimum delay for \(1 \le H \le 100.\)

Our 3:8 decoder is essentially a 3-stage design, which is best suited
for path effort \(F = 3.59^3 = 46.\) For larger decoders or
circuits with larger electrical effort \(H,\) we may increase the
number of stages either by appending inverters to the outputs or by
prepending inverters to the inputs. Furthermore, for large \(n\)
we may implement the \((n/2)\)-input NAND gates using a *tree
structure*.

## 4.7. Memory Elements¶

All of the logical circuits we have discussed so far are **acyclic**,
that is they do not form cycles or loops except implicitly through the
power supply. Acyclic circuits have inputs \(A\) and outputs
\(Y,\) and implement logic functions of the inputs, \(Y =
f(A).\) When the inputs change, the outputs assume the values defined
by \(f\) after a circuit specific propagation delay. Cyclic
circuits exhibit a more complex, time dependent behavior than their
acyclic cousins, because their outputs depend on the input sequence
including current and past input values. Cyclic circuits are of
particular interest for the implementation of memory elements. In
this section, we discuss two of the most commonly used types of memory
elements, the *D-latch* and the *D-flipflop*.

### 4.7.1. Cyclic Circuits¶

A **cyclic circuit** has one or more *loops*, not counting
implicit power supply loops. For example, Figure 4.50
shows two cyclic circuits with one loop each.

Cyclic circuits are more difficult to analyze than acyclic circuits. In particular, the circuits shown in Figure 4.50 do not even have distinguished input and output terminals. The inverter pair on the left of Figure 4.50 has two wires: \(Q_0\) connects the output of inverter 0 to the input of inverter 1, and \(Q_1\) connects the output of inverter 1 to the input of inverter 0. Analogously, the three-inverter cycle on the right has three wires, \(Q_0,\) \(Q_1,\) and \(Q_2.\)

Let us analyze the functionality of the two-inverter loop. The output of each inverter can be 0 or 1. First, assume that \(Q_0 = 0.\) Thus, the input of inverter 1 is 0, hence \(Q_1 = 1.\) Since \(Q_1\) is the input of inverter 0, the output of inverter 0 must be \(Q_0 = 0.\) This is the same value that we started with. We conclude, that the circuit reinforces \(Q_0 = 0\) and, by symmetry, \(Q_1 = 1.\) Second, assume that \(Q_0 = 1.\) Then the output of inverter 1 is \(Q_1 = 0,\) which causes the output of inverter 0 to be \(Q_0 = 1.\) This is also the same value with started out with. We observe that the two-inverter loop reinforces its state in either case \(Q_0 = 0\) or \(Q_0 = 1.\)

The three-inverter loop behaves very differently than the two-inverter loop. Assume that \(Q_0 = 0.\) Since \(Q_0\) is the input of inverter 1, the output of inverter 1 must be \(Q_1 = 1.\) Since \(Q_1\) is the input of inverter 2, its output must be \(Q_2 = 0.\) Now, input \(Q_2\) is the input of inverter 0. Therefore, the output of inverter 0 must be \(Q_0 = 1,\) which is the complement of the assumption we started with. If we traverse the loop for a second time, we find \(Q_0 = 0,\) that is the wire is complemented again. We observe that the circuit does not reinforce its state. Instead, the values of the wires toggle at a speed determined by the propagation delay of the inverters.

We conclude that the two-inverter loop is stable whereas the
three-inverter loop is not. In fact, these behaviors can be observed
in larger loops as well. Every inverter loop with an even number of
inverters is stable. In contrast inverter loops with odd numbers of
inverters as unstable. Odd-numbered inverter loops are called **ring
oscillators**, and are useful to measure the average propagation delay
of an inverter in a given manufacturing process. Ring oscillators are
not used as logic elements in digital circuits. In contrast, loops
with even numbers of inverters are useful building blocks, because
they are *bistable*. A **bistable circuit** is a circuit with two
stable states. A state is **stable**, if the circuit does not
transition into another state without external stimulus. The two
stable states of the bistable two-inverter loop are \(S_0 =
\{Q_0=0, Q_1=1\}\) and \(S_1 = \{Q_0 = 1, Q_1 = 0\}.\) When the
circuit assumes a stable state, the inverters enforce that \(Q_0\)
and \(Q_1\) are complements of each other, \(Q_0 =
\overline{Q_1}.\) Thus, in Figure 4.51, we call
the wires simply \(Q\) and \(\overline{Q}.\)

The problem with the bistable inverter loop in Figure 4.51 is that we cannot control its state. Since the circuit
is stable and has no inputs, it is not obvious how to transition the
circuit from one stable state into the other. The *D-latch*, that we
discuss below, fixes this deficiency.

### 4.7.2. D-Latch¶

A **D-latch** is a bistable memory element with data input \(D,\)
clock input \(\phi,\) and output \(Q.\) The D-latch symbol is
shown on the right. Figure 4.52 shows a D-latch
implementation with a bistable inverter loop and a 2:1 multiplexer to
steer the data input signal into the loop.

Clock input \(\phi\) of the D-latch serves as select input of the multiplexer, and determines whether data input \(D\) asserts the state of the bistable inverter loop. Hence, we distinguish two modes of operation:

\(\phi = 1\): *D-latch is transparent.*

The inverter loop is open, and input \(D\) drives the inverter pair. Since output \(Q\) follows input \(D,\) we say the D-latch istransparent.

\(\phi = 0\): *D-latch is opaque.*

The inverter loop is closed, and stores the current state. Since output \(Q\) retains its value, independent of input \(D,\) we say the D-latch isopaque.

Figure 4.53 shows the D-latch circuit with the multiplexer replaced by its switch model. The mode of operation depends on the position of the multiplexer switch. If \(\phi = 1,\) the loop is open, and input \(D\) drives both output \(Q\) and the inverter pair. Otherwise, if \(\phi = 0,\) the closed loop is disconnected from input \(D,\) and retains value \(Q\) because the loop is a bistable cyclic circuit.

The waveform diagram in Figure 4.54 illustrates the
operation of a D-latch over time. The diagram shows the voltage
levels of the clock and data inputs and the D-latch output.
Initially, the D-latch is opaque and stores output value \(Q =
0.\) Input \(D\) transitions to 1 before the clock signal. When
the D-latch becomes transparent, the output follows input \(D\)
after a propagation delay, indicated by the curved arrow from
\(\phi\) to \(Q.\) The D-latch remains transparent as long as
the clock signal is \(\phi = 1.\) Output \(Q\) follows input
\(D\) after a propagation delay, indicated by the curved arrows
from the transitions of \(D\) to \(Q.\) When the clock
transitions to \(\phi = 0,\) the D-latch stores the last value of
input \(D\) before the clock transition, and holds this value for
as long as the D-latch remains opaque. The D-latch is called
**level-sensitive** because its two modes of operation depend on the
level of the clock signal.

In the following, we derive a CMOS circuit for the D-latch in
Figure 4.52. We begin with the implementation of the
*2:1 multiplexer* with two select arms, integrating
the \(\overline{\phi}\)-arm into the inverter loop as shown in
Figure 4.55. Since this multiplexer circuit
inverts the output, we add an inverter to generate uncomplemented
output \(Q.\) Figure 4.55 shows the
multiplexer select arms as *tristate inverters* in form of an inverter that drives a transmission
gate.

The functionality of the D-latch circuit depends essentially on the state of internal node \(X.\) When the \(\phi\)-arm is closed, it steers input \(D\) to \(X.\) Since the inverting mux arm changes the polarity, we have \(X = \overline{D}.\) The output inverter drives \(X\) to output \(Q,\) so that \(Q = \overline{X} = D\) when clock input \(\phi = 1,\) and the D-latch is transparent. Otherwise, when \(\phi = 0,\) the \(\phi\)-arm is open and the \(\overline{\phi}\)-arm is closed, then node \(X\) is disconnected from input \(D\) and the inverter loop reinforces \(X = \overline{Q}.\) The D-latch is opaque and output \(Q\) retains its value.

A compact CMOS implementation of the mux arms is shown in Figure 4.56. We exploit the circuit equivalence in Figure 4.39 to implement each mux arm with four transistors.

Figure 4.57 shows the complete 12-transistor CMOS circuit for the D-latch, including an interactive switch model. Note that the output of the feedback inverter is \(Y = \overline{X},\) and the circuit also enforces \(X = \overline{Q}\) and \(Y = Q.\)

ϕ = 0 D = 0 Figure 4.57: CMOS circuit for D-latch and interactive switch model.

The timing behavior of the D-latch depends on the mode of operation. When the D-latch is transparent, the critical path stretches from data input \(D\) to output \(Q.\) The signal propagates through internal node \(X,\) and bypasses the feedback loop entirely. The delay is independent of clock signal \(\phi.\) When the D-latch is opaque, the output remains unchanged. Hence, it does not make sense to speak of a propagation delay. Therefore, we discuss the propagation delay of the transparent D-latch. In particular, we analyze the critical path of the D-latch with the matched transistor sizes shown in Figure 4.58.

The propagation delay of the transparent D-latch is the delay of the 2:1 multiplexer plus the delay of the output inverter: \(d_{D\rightarrow Q} = d_{mux} + d_{inv}.\) Assuming that output \(Q\) drives capacitive load \(C_L,\) the output inverter has electrical effort \(h_{inv} = C_L/C_{inv}\) and parasitic delay \(p_{inv}=1,\) such that \(d_{inv} = C_L/3 + 1\) time units. Next, we analyze the multiplexer. The logical effort is the input capacitance \(C_{mux}(D)\) of input \(D\) divided by the input capacitance of a reference inverter \(C_{inv}=3.\) Input \(D\) drives the pMOS transistor of width 4 and the nMOS transistor of width 2 of the \(\phi\)-arm. Thus, \(C_{mux}(D) = 6,\) and the logical effort of input \(D\) of the mux is \(g_{mux}(D) = 6/3 = 2.\) To determine the electrical effort of the mux, notice that output \(X\) of the mux drives two inverters, the feedback inverter and the output inverter. Thus, the load capacitance of the mux is \(C_L(mux) = 2 C_{inv} = 6.\) Therefore, the electrical effort of input \(D\) of the mux is \(h_{mux}(D) = C_L(mux)/C_{mux}(D) = 6/6 = 1.\) The parasitic delay of the 2:1 mux is \(p_{mux} = 4,\) see multiplexer. We find that the delay of the multiplexer is \(d_{mux} = g_{mux}(D) h_{mux}(D) + p_{mux} = 2 \cdot 1 + 4 = 6,\) and the propagation delay of the D-latch amounts to

This delay may serve as a point of reference for circuit optimizations. For example, we observe that the feedback inverter diverts current from the multiplexer output. If we shrink the pMOS transistor of the feedback inverter from 2 to 1 units of normalized width, we reduce the capacitance of the off-path branch and the delay of the path of interest.

Propagation delay \(d_{D\rightarrow Q}\) is only one of several characteristic quantities of the timing behavior of the D-latch. Arguably even more important is the timing behavior of the D-latch when the clock input changes. More succinctly, if both the clock and data inputs change at about the same time, the D-latch may become unstable. Digital circuits with D-latches should avoid this scenario by all means because it can impact the functionality adversely. Functional bugs caused by careless timing behavior are particularly difficult to uncover. We discuss the timing behavior of the D-latch in Figure 4.59, where data input \(D\) transitions at time \(t = 0\) and clock input \(\phi\) at time \(t = 14.\)

Given the D-latch transistor sizes in Figure 4.58 and a load capacitance of \(C_L = 12,\) the delays of the circuit elements of the D-latch are:

\(d_{D\rightarrow X}\): \(\quad \phi\)-arm delay; \(d_{D\rightarrow X} = d_{mux} = 6\)

\(d_{X\rightarrow Y}\): \(\quad\) feedback inverter delay; \(d_{X\rightarrow Y} = 6/3 + 1 = 3\)

\(d_{Y\rightarrow X}\): \(\quad \overline{\phi}\)-arm delay; \(d_{Y\rightarrow X} = d_{mux} = 6\)

\(d_{X\rightarrow Q}\): \(\quad\) output inverter delay; \(d_{X\rightarrow Q} = d_{inv} = C_L/3 + 1 = 5\)

With these element delays, we can express the propagation delay of the transparent D-latch as \(d_{D\rightarrow Q} = d_{D\rightarrow X} + d_{X \rightarrow Q} = 11.\) The waveform diagram shows the corresponding transitions. Initially, the D-latch is transparent because \(\phi = 1,\) and the \(\phi\)-arm is closed whereas the \(\overline{\phi}\)-arm is open. At time \(t=0,\) the \(D\)-input changes from 0 to 1. Output \(Q\) will follow \(D\) after the input has propagated through the \(\phi\)-arm to internal node \(X\) at \(t = 6,\) and then through the output inverter at time \(t = 11.\) The change of internal node \(X\) also affects node \(Y,\) which changes at time \(t = 9\) from 0 to 1 after propagation delay \(d_{X\rightarrow Y}\) of the feedback inverter.

When the D-latch becomes opaque at time \(t = 14,\) it stores value \(Q = 1.\) Clock signal \(\phi = 0\) closes, or turns on, the \(\overline{\phi}\)-arm and opens, or turns off, the \(\phi\)-arm. Thus, after the clock transition at \(t=14,\) signal \(Y=1\) takes propagation delay \(d_{Y\rightarrow X} = 6\) time units through the inverting \(\overline{\phi}\)-arm to reinforce internal node \(X=0\) at time \(t = 20.\) It is this switching delay of the multiplexer that can cause trouble. In particular, the transition of input \(D\) must occur by a sufficiently long time period before the negative transition of clock \(\phi\) to stabilize internal node \(Y\) through the feedback inverter, because the feedback loop is bistable only if \(X = \overline{Y}.\) If we force \(X = Y,\) the feedback loop will assume an unpredictable state. This can happen, if the time interval \(d_{D\rightarrow \phi}\) between the transitions of inputs \(D\) and \(\phi\) is too small. Figure 4.60 illustrates the timing problems of a D-latch.

In the waveform diagram on the left of Figure 4.60, the interval between the transition of input \(D\) and clock \(\phi\) is \(d_{D\rightarrow \phi} = d_{D\rightarrow X} + d_{X\rightarrow Y}.\) At time \(t = 9,\) the negative clock edge begins to close the \(\overline{\phi}\)-arm and to open the \(\phi\)-arm. Internal node \(Y\) follows \(D\) just in time at \(t = 9\) to reinforce internal node \(X\) after the multiplexer delay of \(d_{Y\rightarrow X} = 6\) time units, so that \(X = 0\) for \(t > 15.\) Time interval \(d_{D\rightarrow \phi} = d_{D\rightarrow X} + d_{X\rightarrow Y}\) is the smallest interval for the D-latch to capture input \(D\) safely.

The waveform diagram in the middle of Figure 4.60 assumes that interval \(d_{D\rightarrow \phi}\) is
smaller, that is \(d_{D\rightarrow X} < d_{D\rightarrow \phi} <
d_{D\rightarrow X} + d_{X\rightarrow Y}.\) After the negative clock
edge at time \(t = 8,\) the feedback loop enters unstable state
\(X = Y = 0.\) After the multiplexer delay of
\(d_{Y\rightarrow X} = 6\) time units, at time \(t = 14,\) the
inverting \(\overline{\phi}\)-arm drives value \(\overline{Y}
= 1\) onto inner node \(X.\) However, since the feedback inverter
produces output value \(Y = 1\) just 1 time unit after the
negative clock edge at time \(t = 9,\) the
\(\overline{\phi}\)-arm pulls inner node \(X\) back to 0 at
time \(t = 15.\) The resulting *glitch* on node \(X\)
propagates through the feedback and output inverters. Since inverters
attenuate such glitches, the feedback loop stabilizes and recovers
\(X = 0.\) However, the glitch appears at output \(Q\) after
propagation delay \(d_{glitch} = d_{D\rightarrow \phi} +
D_{Y\rightarrow X} + d_{X\rightarrow Q},\) which is larger than
propagation delay \(d_{D\rightarrow X} + d_{X\rightarrow Q}\) of
the transparent D-latch.

The waveform diagram on the right in Figure 4.60 assumes an even smaller interval \(d_{D\rightarrow \phi} < d_{D\rightarrow X}.\) The negative clock edge opens the \(\phi\)-arm too early to pull inner node \(X\) to 0. The \(\phi\)-arm fights the closing \(\overline{\phi}\)-arm, which succeeds to pull \(X\) to 1. As a consequence, the D-latch fails to capture input \(D=1\) and continues to store the old value \(Q = D = 0.\) The output inverter propagates the glitch of node \(X\) to output \(Q\) before restoring the old output value \(Q = 0.\)

Figure 4.61 summarizes the timing behavior of the D-latch in a graph that plots interval \(d_{D\rightarrow \phi}\) on the horizontal axis and propagation delay \(d_{D\rightarrow Q}\) on the vertical axis. Case (1) corresponds to the scenario in Figure 4.59, where the transition of input \(D\) occurs sufficiently early before the negative clock edge for the D-latch to propagate the change safely to output \(Q.\) The propagation delay is the sum of the delays of the \(\phi\)-arm and the output inverter, \(d_{D\rightarrow Q} = d_{D\rightarrow X} + d_{X\rightarrow Q}.\) Senarios (2), (3), and (4) correspond to the three cases illustrated in Figure 4.60. The closer the transition of input \(D\) gets to the negative clock edge, the larger propagation delay \(d_{D\rightarrow Q}\) becomes. When interval \(d_{D\rightarrow \phi}\) becomes too small, the D-latch fails to capture the new input signal altogether, and retains the old input value. We observe that for a safe operation of the D-latch, we need to guarantee that input \(D\) changes sufficiently early before the negative clock edge, i.e. interval \(d_{D\rightarrow \phi}\) must be large enough.

Thesetup time\(t_{setup}\) is the minimum time input \(D\) must be stable before the negative clock edge to capture the input value within a reasonable delay \(d_{D\rightarrow Q}.\)

For example, reasonable specifications define \(t_{setup}\) to be \(5\,\%\) larger than the propagation delay of the transparent D-latch, \(t_{setup} = 1.05\,(d_{D\rightarrow X} + d_{X\rightarrow Q}).\)

The timing problems of the D-latch occur even when the transition of input \(D\) occurs after the negative clock edge. If input \(D\) transitions before the \(\phi\)-arm is completely open, glitches can propagate to output \(Q.\) Figure 4.62 illustrates delay \(d_{D\rightarrow Q}\) as a function of interval \(d_{D\rightarrow \phi}.\) Analogous to the setup time, for a glitch-free operation of the D-latch, we must ensure that input \(D\) does not transition until a time period after the negative clock edge has passed.

Thehold time\(t_{hold}\) is the minimum time input \(D\) must be stable after the negative clock edge to capture the input value within a reasonable delay \(d_{D\rightarrow Q}.\)

Setup time and hold time characterize the timing behavior of the D-latch around the negative clock edge. Most manufacturers offer D-latches as basic circuit elements, and supply their process specific setup and hold times in their datasheets. For the circuit designer, it is important to ensure that the data input does not change within the interval around the negative clock edge of the D-latch. This is the reason why the clock input of the D-latch is commonly connected to the regular beat of a clock signal with a well defined clock period. Such a clock signal restricts the design choices but gives the designer a clean time reference for the permitted delay range of the circuit that drives the data input of a D-latch.

### 4.7.3. D-Flipflop¶

A **D-flipflop** is a bistable memory element with data input
\(D,\) clock input \(\phi,\) and output \(Q.\) A
D-flipflop is an **edge-triggered** memory element that is activated
by a clock edge. At the triggering clock edge, the D-flipflop stores
input \(D\) until the next triggering clock edge occurs. This
behavior is different from a D-latch which is *level-sensitive*.
Whereas a D-latch is opaque when the clock input level is low, the
D-flipflop is always opaque except for the short time period around
the triggering clock edge. Nevertheless, the D-flipflop can be build
by a series composition of two D-latches with complemented clock
inputs, as shown in Figure 4.63. The
D-flipflop symbol on the right has a triangle at the clock input to
indicate that the flipflop is edge triggered.

The first D-latch with input \(D\) and output \(Q1\) is called
the **master** latch, and is *negative-sensitive*, because it is
transparent when clock input \(\phi = 0.\) The second D-latch
with input \(Q1\) and output \(Q\) is the **slave** latch, and
is *positive-sensitive*, because it is transparent when clock input
\(\phi = 1.\) The D-flipflop is **positive-edge triggered**
because it stores input \(D\) at the rising edge of the clock. If
we invert the clock input, the flipflop would be negative-edge
triggered, and store input \(D\) at the falling edge of the clock.
Figure 4.64 shows a CMOS circuit for the positive-edge
triggered D-flipflop. The circuit saves four transistors by removing
the output inverter of the master latch and the input inverter of the
slave latch.

To understand the functionality of the D-flipflop, consider the switch model and the associated waveform diagrams in Figure 4.65 and Figure 4.66. Figure 4.65 illustrates the operation during the negative half-cycle of the clock. The master latch is transparent, and input \(D\) can propagate to inner node \(Q1.\) Since the slave latch is opaque, input \(D\) cannot propagate beyond \(Q1\) to output \(Q.\) Instead, the slave latch reinforces output value \(Q.\)

During the positive half-cycle of the clock, illustrated in Figure 4.66, the master latch is opaque, and disconnects input \(D\) from inner node \(Q1.\) Thus, the master latch stores the last value of input \(D\) before the positive clock edge. The slave latch is transparent, and propagates the value of \(Q1\) to output \(Q,\) which is the last value of input \(D\) before the positive clock edge.

Note that output \(Q\) remains unchanged after the negative clock edge, because the opaque slave latch stores during the negative half-cycle the value that the master latch stored during the preceding positive half-cycle of the clock.

The triggering clock edge is the timing critical transition of the D-flipflop. In case of the positive-edge triggered D-flipflop, the rising clock edge turns into the falling clock edge of the master latch. This is the critical clock edge, where the setup time and hold time of the master D-latch must be observed. Thus, the D-flipflop is subject to the same timing constraints as a D-latch. Input signal \(D\) must be stable during the setup time before the positive clock edge and during the hold time after the positive clock edge.

The D-flipflop is also called **register**, and is one of the most
widely used memory elements in digital circuit design. A single
D-flipflop implements a 1-bit register. An \(n\)-bit register
consists of \(n\) D-flipflops triggered by the same clock signal.
Each of the \(n\) D-flipflops has an independent D-input and
Q-output. The register symbol is shown on the right.

A D-latch is transparent when clock \(\phi = 1\) and opaque when \(\phi = 0.\) A positive-edge triggered D-flipflop stores input \(D\) at the rising edge of clock signal \(\phi.\) Complete the waveform diagram below with D-latch output \(Q_\text{latch}\) and D-flipflop output \(Q_\text{ff},\) assuming both memory elements are connected to clock \(\phi\) and input signal \(D.\)

D-latch output \(Q_\text{latch}\) follows input \(D\) while clock \(\phi = 1.\) At the negative clock edge, when the clock transitions from 1 to 0, the D-latch retains the value input \(D\) has at the transition.

D-flipflop output \(Q_\text{ff}\) changes at the positive clock edges only, i.e. where the clock signal transitions from 0 to 1. There, it stores the value of input \(D,\) and retains this value for the entire clock cycle until the next positive clock edge occurs.

## 4.8. Wires¶

In circuit diagrams we draw wires to signify the connectivity between terminals of transistors, gates, and larger circuits. Real wires, however, exhibit a more complex timing behavior than meets the eye. We distinguish three wire models:

Ideal wire: The wire has negligible resistance and capacitance that we can ignore for the purpose of delay analysis. In practice, this simple model is relevant for very short wires only.

Capacitive wire: The wire has negligible resistance but significant capacitance. This model applies to medium length wires. For delay analysis, we can model the wire capacitance as a lumped capacitance in parallel to the load capacitance of the gate driving the wire. The method of logical effort captures stray capacitances due to wires without changes.

RC wire: The wire has significant resistance and capacitance. Long wires have significant delay which can be modeled as resistances and capacitances distributed along the extend of the wire. In state-of-the-art VLSI chips long wires have delays spanning tens of clock cycles, and have even become performance limiting. Modeling long wires is beyond the capabilities of the method of logical effort.

In this section, we introduce the *Elmore delay* as an alternative
model for the timing analysis of *RC* circuits, and show that we can
reduce the signal propagation delay of long wires from a quadratic
function in the wire length to a linear function by using repeaters.

### 4.8.1. Elmore Delay¶

When a wire is so long that it incurs a delay comparable or even
larger than the delay of its driver, we need to account for the wire
delay if we wish to minimize the delay of the whole circuit.
Figure 4.67 illustrates the situation with a long wire
driven by an inverter and load capacitance \(C_L.\) We can model
the resistance and capacitance of the wire by considering an
infinitesimally short segment. Then, the methods of calculus apply,
and the model for the wire delay results in the famous *diffusion
equation*, which does not have analytic solutions for the problem at
hand. We could solve the diffusion equation numerically, but a more
insightful model can be derived from an approximation of the wire by
means of finite resistances and capacitances, lumped into discrete
resistors and capacitors, as shown in Figure 4.67.

The **Elmore delay** approximates the delay of an *RC* wire with
\(n\) segments very accurately as the sum of time constants:

Figure 4.68 illustrates the indexing used in this sum for
\(1 \le i \le 4.\) Each index corresponds to a branching node
with an off-path capacitance. If we index the resistors and
capacitors as shown in Figure 4.68, we can interpret the
Elmore delay as the sum of the \(RC\) constants induced by the
currents shown in the *RC* model. The total drive current flowing
from the inverter into the wire is \(i_1 + i_2 + i_3 + i_4.\) By
KCL, \(i_1\) charges capacitor \(C_1,\) whereas current
\(i_2 + i_3 + i_4\) flows into the remaining wire segments. The
*RC* constant of the first segment is \(R_1 C_1.\) Current
\(i_2\) charges capacitor \(C_2.\) Since \(i_2\) flows
through \(R_1\) and \(R_2,\) the time constant for the second
segment is \((R_1 + R_2) C_2.\) For the whole wire model in
Figure 4.68, the Elmore delay amounts to:

The Elmore delay enables us to derive the delay of a wire as a function of its length. Assume that a wire segment of length \(\Delta L\) has lumped resistance \(R\) and capacitance \(C.\) In Figure 4.68, we assume that \(R_i = R\) and \(C_i = C\) for all nodes \(i.\) Then, the Elmore delay of a wire with \(n\) segments and length \(L = n \Delta L\) is:

We conclude that the Elmore delay is proportional to the square of the number of wire segments \(n\) and is, hence, proportional to the square of the wire length, \(t_{RC}(L) \propto L^2.\) Intuitively, this result is obvious considering (1) that the total resistance of a wire \(R_{wire} = n R\) grows linearly in \(L,\) and (2) the total capacitance \(C_{wire} = n C\) grows linearly in \(L\) viewing the wire as one plate of a capacitor and the rest of the chip as the other. Therefore, product \(RC\) grows quadratically in \(L.\)

### 4.8.2. Repeater Insertion¶

If we want to reduce the delay of a long wire, increasing the size of
the driver alone is not an effective method. However, if we use
**repeaters** such as inverters or buffers at carefully chosen
locations distributed across the wire, we can reduce the wire delay
significantly. More succinctly, we divide a wire of length \(L\)
into \(n\) segments of length \(\Delta L = L/n,\) and insert a
repeater at the beginning of each segment, as illustrated in
Figure 4.69. Using inverters as repeaters minimizes the
additional delay introduced by the repeaters themselves.
Nevertheless, to reduce the overall delay, we need to strike the
proper balance in the number of repeaters. Too few repeaters, and the
wire delay remains quadratic in wire length. Too many repeaters, and
the wire delay is dominated by the delay of the inverter chain.

To find the optimal number of repeaters, consider *RC* model of a
single wire segment in Figure 4.70. We assume that the wire of
length \(L\) with total resistance \(R_{wire}\) and
capacitance \(C_{wire}\) has \(n\) segments of length
\(\Delta L = L/n,\) such that each wire segment has resistance
\(R_{wire}/n\) and capacitance \(C_{wire}/n.\) The *RC* model
of the repeating inverter shall have on-resistance \(R_{inv}\) and
output capacitance \(C_{inv}.\) Assuming that all inverters have
the same size, the load capacitance of the wire segment is the input
capacitance of the inverter, \(C_{inv}.\)

The Elmore delay of one wire segment of length \(\Delta L\) including repeating inverter is

Therefore, the Elmore delay of the whole repeated wire of length \(L\) is

Since the optimal number of repeaters \(n\) minimizes the wire delay \(t_{RC}(L),\) we set the derivative to zero:

and find the optimal number of repeaters

With this value of \(n,\) and utilizing the observation that the total resistance and capacitance of the wire are proportional to its length \(L,\) i.e. \(R_{wire} \propto L\) and \(C_{wire} \propto L,\) we find that the minimum delay of the repeated wire is proportional to \(L\):

Here, we assume that the inverter size and, hence, \(R_{inv}\) and \(C_{inv}\) are constants, independent of wire length \(L.\) We conclude, that the delay of long wires does not need to be proportional to the square of their length but, at the expense of inserting repeaters, can be reduced to be directly proportional to their length.

Footnotes

[1] | The number of inputs of a gate is also called fan-in.
For instance, a NAND gate with \(n\) inputs has a fan-in of
\(n.\) |