4. Basic Digital Circuits¶

Digital circuits are compositions of logic gates. A small number of digital circuits occur frequently in larger digital designs, such as multiplexers, encoders, decoders, and memory elements. In this chapter, we use the method of logical effort to study those basic digital circuits that serve as building blocks for the construction of larger digital systems.

4.1. Logic Gates¶

In Section Logic Gates, we introduce logic gates with one and two inputs. Often, we need gates with more than 2 inputs, or wish to design new logic gates for specific logic functions or timing behavior. In the following we study the design and characterization of logic gates as elementary building blocks for digital circuits.

4.1.1. Logic Gates with Multiple Inputs¶

Assume we design a digital circuit and need a NAND gate with 3 inputs. We may assemble the 3-input NAND gate using 2-input NAND gates and an inverter as building blocks, see Figure 4.1. Using Boolean algebra, it is straightforward to show that this circuit implements the logic function \(Y = \overline{A\,B\,C}.\) There are several problems with this implementation, though. First, the delay from inputs \(A\) and \(B\) to \(Y\) is larger than from input \(C.\) Such asymmetries can be confusing when designing larger circuits with delay constraints. Second, the delay of the longest path is larger than necessary, and, third, the CMOS implementation needs 10 transistors, which is also more than necessary.

Figure 4.1: 3-input NAND gate built from 2-input NAND gates.

An alternative design for the 3-input NAND gate uses CMOS transistors as building blocks, as shown in Figure 4.2. This circuit needs only 6 transistors, and is symmetric w.r.t. its inputs. Output \(Y\) is 0 only if all inputs are 1.

A = 0 B = 0 C = 0 Figure 4.2: CMOS circuit of 3-input NAND gate and interactive switch model.

The circuit topology in Figure 4.2 extends to \(n\)-input NAND gates for \(n \ge 2\): compose \(n\) pMOS transistors in parallel and \(n\) nMOS transistors in series.[1] The series composition of the nMOS transistors determines the on-resistance of the pull-down path. The larger number of inputs, the smaller the pull-down current and, hence, the larger the delay of the gate. The model of logical effort reflects this dependence of delay on the number of inputs in the logical effort of the gate. The matched \(n\)-input NAND gate increases the width of the nMOS transistors to match the drive current of the reference inverter. Increasing the transistor widths reduces the on-resistance at the expense of increasing the input capacitance of the gate. This transistor sizing does not solve the problem that the delay increases with increasing numbers of inputs. It merely shifts the burden from the gate itself to its driver.

To determine the logical effort and parasitic delay of the 3-input NAND gate, we size the transistors to match the output current of the reference inverter. Figure 4.3 shows the matched 3-input NAND gate with transistor widths. When all inputs are 1, all three nMOS transistors are switched on. The series of on-resistances sums to a total of \(3 R_n(nand).\) To be equal to on-resistance \(R_n(inv)\) of the reference inverter, we need to triple the widths of the nMOS transistors of the NAND gate, \(W_n(nand) = 3 W_n(inv) = 3.\) If one of the NAND inputs is 0, then one pMOS transistor is switched on. To match the on-resistanace of the reference inverter, we assign the same width \(W_p(nand) = W_p(inv) = 2\) to the pMOS transistors of the NAND gate. If more than one NAND input is 0, then the parallel composition of pMOS transistors has a lower on-resistance than for a single input. The corresponding pull-up delay is smaller if only one input is 0. Therefore, using normalized pMOS widths of 2 units matches the reference inverter in the worst case when the pull-up delay of the NAND gate is largest.

Figure 4.3: Matched 3-input NAND gate.

The matched 3-input NAND gate enables us to determine the logical effort of a 3-input NAND gate by means of the normalized input capacitances. Each input, e.g. input \(A,\) drives two parallel gate capacitances of one pMOS and one nMOS transistor:

\[C_{in}(A) = W_p(A) + W_n(A) = 2 + 3 = 5\,.\]

Therefore, the logical effort of the 3-input NAND gate is

\[g_{nand3} = \frac{C_{in}(A)}{C_{in}(inv)} = \frac{5}{3}\]

per input. Note that the logical effort of the 3-input NAND gate is by \(1/3\) larger than the logical effort of the 2-input NAND gate. Furthermore, we determine the parasitic delay of the 3-input NAND gate by calculating the normalized output capacitance:

\[C_{out}(nand3) = W_p(A) + W_p(B) + W_p(C) + W_n(A) = 3 \cdot 2 + 3 = 9\,.\]

Thus, the parasitic delay of a 3-input NAND gate is

\[p_{nand3} = \frac{C_{out}(nand3)}{C_{out}(inv)} = \frac{9}{3} = 3\,.\]

We find that the parasitic delay of the 3-input NAND gate is by one delay unit larger than that of the 2-input NAND gate.

For NAND gates with \(n\) inputs, the matched gate has nMOS transistors of width \(n\) and pMOS transistors of width 2. Therefore, the logical effort of the \(n\)-input NAND gate is

(1)\[g_{nandn} = \frac{n + 2}{3}\]

per input, and has parasitic delay

(2)\[p_{nandn} = \frac{n \cdot 2 + n}{3} = n\,.\]

The larger the number of inputs \(n,\) the larger the logical effort and the parasitic delay of the NAND gate. As a consequence, the more inputs the gate has, the slower it is, or, from the circuit designer perspective, the more effort we need to invest in the driver circuit to keep the path delay low.

The design of an \(n\)-input NOR gate procedes analogously to the \(n\)-input NAND gate. The \(n\)-input NOR gate has \(n\) parallel nMOS transistors and \(n\) pMOS transistors in series. The matched \(n\)-input NOR gate has nMOS transistors of width \(W_n = 1\) and pMOS transistors of width \(W_p = 2n.\) As a result each input of the \(n\)-input NOR gate has logical effort

\[g_{norn} = \frac{2n + 1}{3}\,,\]

and the parasitic delay of the gate is

\[p_{norn} = \frac{n \cdot 1 + 2n}{3} = n\,.\]

For example, a 3-input NOR gate has logical effort \(g_{nor3} = 7/3\) per input and parasitic delay \(p_{nor3} = 3.\) The logical effort of an \(n\)-input NOR gate is larger than the logical effort of an \(n\)-input NAND gate. Thus, if we have the choice to implement the logic of a circuit path with NAND gates or NOR gates, we commonly prefer NAND gates, because they tend to yield lower path delays.

4.1.2. Tree-Structured Logic Gates¶

When the number of inputs of logic gates is large, tree-structured gates offer superior performance compared to CMOS gates. As a concrete example, consider a 16-input NAND gate. The logical effort of a CMOS NAND gate with a pull-down network of 16 nMOS transistors in series is \(g_{nand16} = 6\) according to Equation (1), and the parasitic delay is \(p_{nand16} = 16\) according to Equation (2). Figure 4.4 shows an alternative, tree-structured implementation that uses 2-input NAND gates and inverters as building blocks. The tree has a path logical effort of \(G = g_{nand2}^4 g_{inv}^3 = 3.16,\) and a path parasitic delay of \(P = 4 p_{nand2} + 3 p_{inv} = 11.\) The tree structure improves both quantities.

Figure 4.4: A tree-structured 16-input NAND gate using 2-input NAND gates and inverters as building blocks.

The tree structure of the NAND circuit is inspired by Boolean algebra, more succinctly the associativity of the AND operation:

\[(A \cdot B) \cdot C = A \cdot (B \cdot C)\,.\]

Associativity implies that we can parenthesize an expression any way we want to. We may even omit the parentheses altogether because, mathematically, the parenthesization does not affect the value of the expression. From the perspective of the circuit designer, however, parentheses express an evaluation order. For example, given a binary AND operation and 8 operands \(A_0, A_1, \ldots, A_7,\) there are many different evaluation orders. One of them corresponds to the left-skewed parenthesization

\[(((((((A_0 \cdot A_1) \cdot A_2) \cdot A_3) \cdot A_4) \cdot A_5) \cdot A_6) \cdot A_7)\,,\]

which groups the operations such that all 7 AND operations must be executed one after each other. The corresponding circuit has a depth of 7 operations. In contrast, the tree-structured parenthesization yields the minimum depth of \(\log_2 8 = 3\) operations:

\[(((A_0 \cdot A_1) \cdot (A_2 \cdot A_3)) \cdot ((A_4 \cdot A_5) \cdot (A_6 \cdot A_7)))\,.\]

Figure 4.5 shows the corresponding circuits. We note that the tree structure is asymptotically optimal, i.e. for a large enough number of inputs \(n,\) no other circuit topology has a depth smaller than \(\log_2 n,\) up to constant factors. The constants depend on the number of inputs of the building blocks. For example, if we replace subtrees of three 2-input AND gates with one 4-input AND gate, the depth of the tree-structured AND gate in Figure 4.5 decreases from 3 to 2 levels.

Figure 4.5: Left-skewed (a) and tree-structured (b) conjunction of 8 inputs with 7 binary AND operators.

Given a tree-structured \(n\)-input AND gate, we obtain an \(n\)-input NAND gate by complementing the output. The 16-input NAND gate in Figure 4.4 implements each 2-input AND gate of the tree with a 2-input CMOS NAND circuit followed by an inverter. To complement the output, we omit the inverter at the root of the tree. The 3-input NAND gate in Figure 4.1 is an example of an unbalanced NAND tree, where the number of inputs is not a power of 2. In contrast, the trees in Figure 4.4 and Figure 4.5 are balanced, because each path from input to output has the same depth, or number of gates.

4.1.3. Asymmetric Gates¶

An asymmetric gate has inputs with different logical efforts. Asymmetric gates require additional attention from the circuit designer, but facilitate faster circuits if used appropriately. The asymmetry may be caused by the circuit topology or by deliberate transistor sizing to reduce the delay of the critical path. We design a majority gate to motivate asymmetric circuit topologies, and then discuss how to trade logical effort between inputs.

Majority Gate¶

The majority operation is a ternary Boolean operation, cf. majority function:

(3)\[M(A,B,C) = A B + A C + B C\,.\]

In terms of a truth table the majority operation is specified as:

A B C M

0 0 0 0

0 0 1 0

0 1 0 0

0 1 1 1

1 0 0 0

1 0 1 1

1 1 0 1

1 1 1 1

A	B	C	M
0	0	0	0
0	0	1	0
0	1	0	0
0	1	1	1
1	0	0	0
1	0	1	1
1	1	0	1
1	1	1	1

As the name suggests, the output of the majority operation is the value that occurs more often in inputs \(A,\) \(B,\) and \(C.\) The inverting 3-input majority gate in Figure 4.6 computes the complement \(Y = \overline{M}.\)

A = 0 B = 0 C = 0 Figure 4.6: Symmetric CMOS circuit of inverting 3-input majority gate and interactive switch model.

We characterize the delay of the majority gate by deriving its logical effort and parasitic delay. The majority gate has six arms, each of which is a series composition of two transistors. To match the on-resistances of the reference inverter, we make the transistors in each arm of the majority gate twice as wide as those of the reference inverter. Thus, as shown in Figure 4.7, all pMOS transistors have width \(W_p = 4\) and all nMOS transistors have width \(W_n = 2.\) These transistor widths match the reference inverter for all combinations of input values, except when all input values are equal. In this case all three arms of either pull-up or pull-down network are switched on, such that the parallel composition yields a three times smaller equivalent on-resistance. We classify these input combinations as exceptionally fast, analogous to the exceptional input combinations of the NAND or NOR gates.

Figure 4.7: Matched inverting majority gate.

Each input of the majority gate drives two pMOS and two nMOS transistors. Therefore, all three inputs \(A,\) \(B,\) and \(C\) have logical effort

\[g_M(A) = g_M(B) = g_M(C) = \frac{2 W_p + 2 W_n}{C_{in}(inv)} = 4\,.\]

The normalized output capacitance of the majority gate consists of three pMOS and three nMOS transistors. Therefore, the parasitic delay of the majority gate is

\[p_{M} = \frac{3 W_p + 3 W_n}{C_{out}(inv)} = 6\,.\]

Compared to a NAND or NOR gate, the majority gate is relatively slow. One option for reducing the logical effort and parasitic delay of the majority gate is transistor sharing. Observe in Figure 4.7 that the two pMOS transistors that one input drives in the pull-up network are in distinct arms. The pull-down network exhibits the same structure only with nMOS transistors. If two arms share one transistor, we can reduce the logical effort by reducing the number of transistors from two to one. This does not work for all inputs simultaneously. As shown in Figure 4.8, if we share the transistors of input \(A,\) we cannot also share the transistors of inputs \(B\) and \(C\) across arms without altering the logic function. However, sharing the transistors of input \(A\) reduces its logical effort, and also the parasitic delay of the entire gate.

A = 0 B = 0 C = 0 Figure 4.8: Asymmetric inverting 3-input majority gate with transistor sharing and interactive switch model.

Sharing the transistors of input \(A,\) as shown in Figure 4.8, does not affect the transistor widths of the matched majority gate. A path from supply rail to output consists of a series composition of two transistors, either in the pull-up or pull-down network, just like in Figure 4.7. The exceptional case, when all three input values are equal, has a larger equivalent on-resistance than the gate in Figure 4.7, but is still faster than the regular case. Thus, the matched version of the majority gate in Figure 4.8 has transistor widths \(W_p = 4\) and \(W_n = 2.\) While the logical efforts of inputs \(B\) and \(C\) remain unchanged, transistor sharing halves the logical effort of input \(A\):

\[\begin{eqnarray*} g_{Masym}(A) &=& \frac{W_p + W_n}{C_{in}(inv)}\ =\ 2\,, \\ g_{Masym}(B) = g_{Masym}(C) &=& \frac{2 W_p + 2 W_n}{C_{in}(inv)}\ =\ 4\,. \end{eqnarray*}\]

Since the logical effort of input \(A\) differs from the logical effort of inputs \(B\) and \(C,\) the majority gate in Figure 4.8 is an asymmetric gate. In contrast, the majority gate in Figure 4.6 is a symmetric gate, because all inputs have equal logical effort. Rather than characterizing an asymmetric gate by listing the logical efforts for all inputs, it is sometimes sufficient to use a single number instead. The total logical effort is the sum of the logical efforts of all inputs. For example, the total logical effort permits a quantitative comparison of the symmetric and asymmetric majority gates. The symmetric majority gate has total logical effort \(g_{M} = g_{M}(A) + g_{M}(B) + g_{M}(C) = 3 \cdot 4 = 12.\) Transistor sharing reduces the total logical effort of the asymmetric gate to \(g_{Masym} = 2 + 2 \cdot 4 = 10.\)

Sharing the transistors of input \(A\) reduces not only the logical effort but also the output capacitance and, thus, the parasitic delay of the asymmetric majority gate:

\[p_{Masym} = \frac{2 W_p + 2 W_n}{C_{out}(inv)} = 4\,.\]

Note that we can construct alternative gate designs by sharing the transistors of input \(B\) or of input \(C.\) The resulting asymmetric gates do not reduce the output capacitance, however. Only sharing the transistors driven by input \(A\) reduces the parasitic delay. Thus, if we have the choice, we prefer sharing those transistors connected to the output of the gate in order to reduce the parasitic delay as added benefit.

Asymmetric Transistor Sizing¶

The asymmetric majority gate sacrifices the symmetry of the circuit topology for a reduced logical effort of one input without affecting the logical effort of the other inputs. The benefit of the reduced logical effort is a reduced gate delay for signal transitions on the corresponding input. Even with a symmetric gate topology, inherent to NAND and NOR gates for instance, we can trade logical effort between inputs by transistor sizing. The resulting asymmetry may be desirable to speed up the critical path of a circuit.

As an example, consider the 2-input NAND gate in Figure 4.9. Assume that input \(A\) is critical, and we wish to reduce its delay. Thus, our goal is to minimize the logical effort of input \(A,\) which means to reduce \(g_{nand2}(A)\) from 4/3 to that of a reference inverter \(g_{inv} = 1.\)

Figure 4.9: Matched asymmetric NAND gate with nMOS scaling fraction \(\alpha.\)

Since the matched NAND gate has normalized nMOS transistor widths \(W_n(A) = W_n(B) = 2,\) reducing width \(W_n(A)\) to 1 reduces the input capacitance of input \(A\) as desired. However, if all we do is reduce the input capacitance of input \(A,\) the gate becomes mismatched w.r.t. the reference inverter. In fact, halving the width of the nMOS transistor doubles its on-resistance, which results in a smaller output current. Consequently, the gate would be slower rather than faster as planned. To obtain a delay reduction, we need to reduce the logical effort, i.e. we wish to reduce the input capacitance of input \(A\) without changing the equivalent on-resistance of the pull-down network. Transistor sharing in the majority gate has exactly this effect.

In case of the 2-input NAND gate, we can balance a change of width \(W_n(A)\) by a corresponding change of \(W_n(B)\) so that the pull-down network of the asymmetric NAND gate remains matched to the reference inverter. More specifically, matching an asymmetric 2-input NAND gate requires that equivalent on-resistance \(R_n(A) + R_n(B)\) of the series composition of nMOS transistors must be equal to on-resistance \(R_n(inv)\) of the reference inverter:

\[R_n(A) + R_n(B) = R_n(inv)\,.\]

Since the on-resistance is indirectly proportional to the transistor width, we obtain this matching constraint for the asymmetric NAND gate:

\[\begin{split}\begin{eqnarray*} \frac{1}{W_n(A)} + \frac{1}{W_n(B)} &=& \frac{1}{W_n(inv)} \\ \Leftrightarrow\qquad W_n(A) + W_n(B) &=& W_n(A) W_n(B)\,, \end{eqnarray*}\end{split}\]

because the reference inverter has an nMOS transistor of width \(W_n(inv) = 1.\) We can fulfill this constraint with a scaling fraction \(\alpha,\) \(0 < \alpha < 1,\) if we choose \(W_n(A) = 1/(1-\alpha)\) and \(W_n(B) = 1/\alpha,\) as shown in Figure 4.9. For \(\alpha = 1/2,\) we obtain the symmetric matched NAND gate with \(W_n(A) = W_n(B) = 2.\) Scaling fraction \(\alpha < 1/2\) reduces the logical effort of input \(A.\) As \(\alpha\) approaches 0, width \(W_n(A)\) and logical effort \(g_{nand2asym}(A)\) approach value 1. The table below lists the relevant quantities for several values of \(\alpha.\)

\(\alpha\)	\(W_n(A)\)	\(W_n(B)\)	\(g_{nand2asym}(A)\)	\(g_{nand2asym}(B)\)	\(g_{nand2asym}\)
1/2	2	2	4/3	4/3	2.67
1/4	4/3	4	10/9	2	3.11
1/10	10/9	10	28/27	4	11.04
1/100	100/99	100	298/297	34	101.0

The price of reducing the logical effort of input \(A\) is a rapidly growing logical effort of input \(B.\) Choosing \(\alpha = 1/4\) appears to be a good compromise, where \(g_{nand2asym}(A) = 1.1\) is within \(10\,\%\) of the minimum logical effort, yet the total logical effort of the gate increases by \(16\,\%\) only.

4.1.4. Compound Gates¶

There exist complex Boolean expressions that we can implement in a single CMOS gate. In particular, all Boolean expressions formed with AND and OR operations can be cast into a CMOS gate if all variables appear in uncomplemented form only and the entire AND-OR expression is complemented.

Two widely used families of compound gates are the inverting AND-OR (AOI) gates and the inverting OR-AND (OAI) gates. As an example, consider the Boolean expression

\[Y = \overline{A \cdot B + C}\,.\]

If our CMOS gate supply consists of NAND, NOR, and NOT gates, we might be tempted to use the involution theorem, and transform the expression such that we can implement \(Y\) with these gates:

\[Y = \overline{A \cdot B + C} = \overline{\overline{\overline{(A \cdot B)}} + C}\,.\]

The circuit for \(Y\) is shown in Figure 4.10. The logical effort of the longest paths \(A \rightarrow Y\) and \(B \rightarrow Y\) is \(G = g_{nand} g_{inv} g_{nor} = 20/9\) and the corresponding path parasitic delay is \(P = p_{nand} + p_{inv} + p_{nor} = 5.\)

Figure 4.10: Implementation of \(Y = \overline{A \cdot B + C}\) with NAND, NOR, and NOT gates.

Now, recall the completeness theorem, which tells us that any Boolean expression of AND and OR operations can be realized with a series-parallel switch network. In our example, \(Y=0\) if and only if \(A=1\) AND \(B=1,\) OR if \(C=1.\) This logical equivalence translates directly into the CMOS pull-down network in Figure 4.11(a). We implement the conjunction of \(A\) and \(B\) with a series composition of nMOS transistors, and the disjunction of \(AB\) and \(C\) with a parallel composition. With the pull-down network in place, we derive the pull-up network in Figure 4.11(b) systematically by applying the principle of duality. The pull-up network uses pMOS instead of nMOS transistors, and we replace the parallel composition with a series composition and vice versa.

Figure 4.11: CMOS gate for \(Y = \overline{A \cdot B + C}\): (a) pull-down network, (b) pull-up network is dual of pull-down network, (c) matched CMOS gate.

Figure 4.11(c) redraws the pull-up network slightly for the sake of clarity, and annotates the transistor sizes needed to match the reference inverter. The gate is asymmetric, because the logical efforts of the inputs differ:

\[g(A) = g(B) = 2\,,\qquad g(C) = 5/3\,.\]

The parasitic delay of the gate is \(p = 7/3.\) Compared to the logically equivalent circuit in Figure 4.10, the CMOS gate offers improvements in terms of speed and number of transistors. To emphasize the fact that our CMOS gate is a compound gate, we introduce the symbol in Figure 4.12 by fusing an AND gate with a NOR gate. Since the output of the AND gate drives the input of the OR gate whose output is complemented by an inverter, the compound gate is an inverting AND-OR gate or AOI gate for short.

Figure 4.12: The AOI gate symbol reflects the logic function.

AOI gates exist for the complement of all Boolean expressions in SOP normal form.

Analogous to AOI gates, the family of OAI gates, or inverting OR-AND gates, consists of compound gates for Boolean expressions in complemented POS normal form. As an example, consider Boolean expression

\[Y = \overline{(A + B) \cdot (C + D)}\,.\]

The OAI gate for \(Y\) and the corresponding OAI symbol are shown in Figure 4.13. The matched gate is symmetric, with logical effort \(g=2\) per input and parasitic delay \(p=4.\)

Figure 4.13: Matched OAI gate for \(Y = \overline{(A + B) \cdot (C + D)}\) and corresponding OAI gate symbol.

Besides the AOI and OAI gates for complemented two-level normal forms, we can design compound gates for multilevel expressions as well. For example, Boolean expression

\[Y = \overline{(A + BC) \cdot D}\]

has a pull-down network consisting of a series composition of \(D\) and a parallel composition of \(A\) and a series composition of \(B\) and \(C.\)

Figure 4.14: Matched multilevel compound gate for \(Y = \overline{(A + BC) \cdot D}\) and corresponding compound gate symbol.

The compound gate in Figure 4.14 is asymmetric, because the logical efforts of the inputs differ:

\[g(A) = 2\,,\qquad g(B) = g(C) = 8/3\,,\qquad g(D) = 4/3\,.\]

The parasitic delay of the compound gate is \(p = 8/3.\)

CMOS designers often use compound gates as efficient implementation target. If one of the inputs must be complemented, then it is easy to introduce a separate inverter stage to drive the input of the compound gate. If we want an AO or OA function without inverted output, we may use an additional inverter stage to complement the output of the corresponding AOI or OAI gate.

4.1

Design a CMOS compound gate to compute

\(\qquad{\displaystyle Y = \overline{(A + B + C) \cdot D}\,.}\)

Derive either the pull-down or the pull-up network.
Deduce the complementary network using the principle of duality.
Perform transistor sizing to match the compound gate.
Determine the logical effort(s) and parasitic delay.

We derive the pull-down network by noticing in the Boolean expression that \(Y = 0\) if \((A+B+C)\cdot D = 1.\) Therefore, we design a series-parallel pull-down network that connects output \(Y\) to GND when the network is switched on. Boolean subexpression \(A+B+C\) translates into a parallel composition of nMOS transistors, and the conjunction with \(D\) into a series composition. The Figure below shows the pull-down network in part (a).
We derive the pull-up network from the pull-down network using the principle of duality. For each nMOS transistor we introduce the corresponding pMOS transistor. The parallel composition of the nMOS transistors controlled by inputs \(A,\) \(B,\) and \(C\) becomes a series composition of pMOS transistors. Furthermore, the series composition of nMOS transistor \(D\) with the parallel composition turns into a parallel composition of pMOS transistor \(D\) with the series composition of pMOS transistors. The pull-up network is shown as part of the complete CMOS compound gate in part (b) of the Figure above.
The transistor sizes for the matched compound gate are chosen to have the same on-resistances as the reference inverter, see part (c) of the Figure above. We start with the pull-down network. The reference inverter has an nMOS transistor of width \(W_n(inv) = 1.\) Our compound gate has three paths to switch on the pull-down network, through nMOS transistor \(D\) and one of nMOS transistors \(A,\) \(B,\) or \(C.\) The largest on-resistance appears if exactly one of the paths is switched on. This path consists of a series composition of two nMOS transistors. Since the on-resistance of two resistors in series add, we want each on-resistance to be half that of the reference inverter. We achive this goal by making the nMOS transistors twice as wide, i.e. we choose widths \(W_n = 2 \cdot W_n(inv) = 2\) for all nMOS transistors. If more than two paths through the pull-down network are switched on, the on-resistance will be smaller than that of the reference inverter, which we consider as an exceptional case that is faster than the worst case.

The pull-up network must have the same on-resistance as the pMOS transistor of the reference inverter of width \(W_p(inv) = 2.\) The pull-up network has two paths. In the worst case, only one of these paths is switched on. To match the path through pMOS transistor \(D,\) we assign width \(W_p(D) = W_p(inv) = 2.\) The other path consists of a series composition of three pMOS transistors. If each of these transistors has one third of the on-resistance of the pMOS transistor of the reference inverter, the equivalent on-resistance of the path equals that of the reference inverter. Thus, we triple the widths \(W_p(A) = W_p(B) = W_p(C) = 3 \cdot W_p(inv) = 6.\) In the exceptional case of input combination \(ABCD=0000,\) both pull-up paths are switched on, the equivalent on-resistance of the parallel paths is smaller than that of the reference inverter, and the compound gate will be faster.
The logical effort of a compound gate input is its input capacitance divided by the input capacitance of the reference inverter \(C_{in}(inv) = 3.\) Consider input \(A\) of our compound gate. Input capacitance \(C_{in}(A)\) is the sum of the normalized transistor widths input \(A\) controls, here \(W_p(A) = 6\) and \(W_n(A) = 2.\) Thus, we find \(C_{in}(A) = 6 + 2 = 8.\) Inputs \(B\) and \(C\) have the same input capacitance, whereas input \(D\) has input capacitance \(C_{in}(D) = 2 + 2 = 4.\) The compound gate is asymmetric with logical efforts:

\[g(A) = g(B) = g(C) = \frac{8}{3}\,,\quad g(D) = \frac{4}{3}\,.\]

The parasitic delay of the compound gate is the ratio of its output capacitance and the output capacitance of the reference inverter \(C_{out}(inv) = 3.\) The output capacitance of the compound gate is the sum of the normalized widths of all transistors connected to output \(Y,\) here \(C_{out} = W_p(C) + W_p(D) + W_n(D) = 6 + 2 + 2 = 10.\) Therefore, the parasitic delay of our compound gate is

\[p = \frac{10}{3}\,.\]

4.1.5. XOR and XNOR Gates¶

Our CMOS implementation of the XOR gate in Figure 1.42 deserves a closer look. Unlike NAND or NOR gates, the XOR gate assumes that the inputs are available in complemented and uncomplemented form, and the pull-up and pull-down networks are not duals of each other. First, however, we characterize the delay of the 2-input XOR gate by deriving its logical effort and parasitic delay. The matched 2-input XOR gate is shown in Figure 4.15. Each arm is a series composition of two transistors. To match the on-resistances of the reference inverter, we make the transistors in each arm of the XOR gate twice as wide.

Figure 4.15: Matched 2-input XOR gate.

Each of the four inputs of the 2-input XOR gate, \(A,\) \(\overline{A},\) \(B,\) and \(\overline{B}\) drives one pMOS and one nMOS transistor. Therefore, each input has logical effort

\[g_{xor2} = \frac{4 + 2}{3} = 2\,.\]

The normalized output capacitance of the XOR gate consists of two pMOS and two nMOS transistors, so that the parasitic delay is

\[p_{xor2} = \frac{4 + 4 + 2 + 2}{3} = 4\,.\]

Compared to the 2-input NAND and NOR gates, the XOR gate uses twice as many transistors, whose capacitances make it relatively slow.

Symmetric XOR Gates¶

XOR gates with \(n\)-inputs implement the odd-parity operation, which outputs a 1 if the number of inputs with value 1 is odd, cf. ref:parity function <parity-function>. For example, for \(n = 3,\) function \(Y = A \oplus B \oplus C\) equals 1 if the number of 1-inputs is 1 or 3. The truth table shows the 3-input XOR function or odd-parity function.

A B C Y

0 0 0 0

0 0 1 1

0 1 0 1

0 1 1 0

1 0 0 1

1 0 1 0

1 1 0 0

1 1 1 1

A	B	C	Y
0	0	0	0
0	0	1	1
0	1	0	1
0	1	1	0
1	0	0	1
1	0	1	0
1	1	0	0
1	1	1	1

The odd-parity property suggests a design for an XOR CMOS circuit. To build an \(n\)-input XOR gate, we need \(2^n\) arms with \(n\) transistors each. Half of the arms are in the pull-up network, one per minterm, and each has an odd number of complemented inputs. The other half of the arms are in the pull-down network. For each combination of input values, exactly one arm is switched on. Figure 4.16 shows the CMOS circuit for a 3-input XOR gate. Whenever you toggle one of the inputs output \(Y\) toggles, because changing one input changes the parity from odd to even or vice versa.

A = 0 B = 0 C = 0 Figure 4.16: CMOS circuit of 3-input XOR gate and interactive switch model.

The matched 3-input XOR gate has pMOS transistors of width \(W_p = 6\) and nMOS transistors of width \(W_n = 3.\) Each of the six inputs \(A,\) \(\overline{A},\) \(B,\) \(\overline{B},\) \(C,\) and \(\overline{C}\) drives two pMOS and two nMOS transistors, resulting in a logical effort of \(g_{xor3} = 6\) per input. The output is connected to four pMOS and four nMOS transistors, that contribute to the output capacitance and parasitic delay \(p_{xor3} = 12.\) Since the gate topology generalizes to \(n\) inputs, we find that the \(n\)-input XOR gate has logical effort

\[g_{xorn} = \frac{2^{n-2} \cdot 2n + 2^{n-2} \cdot n}{3} = n 2^{n-2}\]

per input. The parasitic delay of the \(n\)-input XOR gate amounts to

\[p_{xorn} = \frac{2^{n-1} \cdot 2n + 2^{n-1} \cdot n}{3} = n 2^{n-1}\,.\]

The exponential growth of both logical effort and parasitic delay in the number of inputs \(n\) limits the applicability of this CMOS circuit to small \(n.\) For larger numbers of inputs tree-structured topologies based on 2-input XOR gates as building blocks are faster.

Asymmetric XOR Gates¶

We can reduce the logical effort and parasitic delay of the XOR gate by transistor sharing. Notice in Figure 4.16 that each input drives two pMOS and two nMOS transistors. If two arms share one transistor, we can reduce the logical effort of the corresponding input. Sharing does not work for all inputs simultaneously. As shown in Figure 4.17 we share the transistors of inputs \(A\) and \(\overline{A}\) to reduce their logical efforts and the parasitic delay of the entire gate. Then, we can also share the transistors of inputs \(C\) and \(\overline{C},\) but not those of inputs \(B\) and \(\overline{B}\) without altering the logic function.

A = 0 B = 0 C = 0 Figure 4.17: Asymmetric CMOS circuit of 3-input XOR gate and interactive switch model.

To match the XOR gate, observe that any path from supply to output consists of a series composition of 3 transistors, just like an arm in Figure 4.16. Therefore, sharing transistors does not affect the transistor widths of the matched gate. Like the symmetric XOR gate, all pMOS transistors of the matched gate have width \(W_p = 6\) and all nMOS transistors width \(W_n = 3.\) As a result, the XOR gate in Figure 4.17 is asymmetric. The logical efforts of inputs \(A,\) \(\overline{A},\) \(C\) and \(\overline{C}\) are

\[g_{xor3asym}(A) = g_{xor3asym}(\overline{A}) = g_{xor3asym}(C) = g_{xor3asym}(\overline{C}) = \frac{W_p + W_n}{3} = 3\,,\]

while the logical efforts of inputs \(B\) and \(\overline{B}\) remain unchanged compared to the symmetric design:

\[g_{xor3asym}(B) = g_{xor3asym}(\overline{B}) = \frac{2 Wp + 2 W_n}{3} = 6\,.\]

Sharing the transistors connected to the output of the gate halves the output capacitance, and therefore, the parasitic delay of the asymmetric XOR gate is

\[p_{xor3asym} = \frac{2 W_p + 2 W_n}{3} = 6\,.\]

The asymmetric XOR gate has smaller total logical effort and smaller parasitic delay than the symmetric gate.

XOR Gates and Duality¶

The symmetric 2-input XOR gate in Figure 4.15 violates the principle of duality, because the pull-up and pull-down networks are not duals of each other. In this section, we use the 2-input XOR gate to demonstrate that duality is a sufficient but not a necessary property of CMOS gates.

Recall the truth table of the 2-input XOR gate, \(Y = A \oplus B\):

A B Y

0 0 0

0 1 1

1 0 1

1 1 0

A	B	Y
0	0	0
0	1	1
1	0	1
1	1	0

The SOP normal form for the XOR function is

\[Y = \overline{A} B + A \overline{B}\,.\]

Interpret this equality as \(Y=1\) if \(A=0\) AND \(B=1\) OR if \(A=1\) AND \(B=0.\) Negate predicates \(B=1\) and \(A=1\) to \(\overline{B}=0\) and \(\overline{A}=0,\) to simplify the argument for the switch model of pMOS transistors. Then, we obtain the equivalent XOR logic:

\(Y=1\) if (\(A=0\) AND \(\overline{B}=0\)) OR if (\(\overline{A}=0\) AND \(B=0\)).

The Boolean expression consists of AND and OR operations, which translates directly into the series-parallel pull-up network in Figure 4.18(a). Given the pull-up network, we can derive the pull-down network by forming the dual, as shown in Figure 4.18(b). We transform the parallel arms of the pull-up network into a series composition of parallel nMOS transistors. The interactive switch model enables you to verify the truth table of the XOR gate.

A = 0 B = 0 Figure 4.18: 2-input XOR gate: (a) pull-up network derived from logic function, (b) the pull-down network is the dual of the pull-up network, and (c) interactive switch model.

The XOR gate in Figure 4.18 has the same pull-up network as our original XOR gate in Figure 4.15, up to the order of the pMOS gate inputs. However, the pull-down networks are different. To arrive at the XOR gate in Figure 4.15, consider the truth table of the XOR gate again, now with the goal to derive the pull-down network from the logic function. We see that \(Y=0\) if \(A=0\) AND \(B=0\) OR if \(A=1\) AND \(B=1.\) To fit the switch model of nMOS transistors, negate the predicates \(A=0\) and \(B=0\) to \(\overline{A}=1\) and \(\overline{B}=1,\) and we find the XOR logic:

\(Y=0\) if (\(\overline{A}=1\) AND \(\overline{B}=1\)) OR if (\(A=1\) AND \(B=1\)).

This Boolean expression translates directly into the series-parallel pull-down network in Figure 4.19(a). Deriving the dual of the pull-down network yields the pull-up network in Figure 4.19(b).

A = 0 B = 0 Figure 4.19: 2-input XOR gate: (a) pull-down network derived from logic function, (b) the pull-up network is the dual of the pull-down network, and (c) interactive switch model.

The XOR gate in Figure 4.19 has the same pull-down network as our original XOR gate in Figure 4.15, but a different pull-up network. However, we arrive at the XOR gate design in Figure 4.15 by combining the pull-up network of Figure 4.18 with the pull-down network of Figure 4.19. This design saves a wire and is simpler than the designs with dual networks, because the four arms are independent. Furthermore, the XOR gate demonstrates that CMOS gates do not necessarily have pull-up and pull-down networks that are duals of each other. The 3-input XOR gates and the majority gate are examples of CMOS gates whose pull-up and pull-down networks are not duals either.

XOR CMOS Circuits¶

The 2-input XOR gate in Figure 4.15 requires complemented and uncomplemented inputs. In fact, we may specify this CMOS gate as a logic function of four inputs, XOR\((A, B, C, D),\) with the additional constraints \(C = \overline{A}\) and \(D = \overline{B},\) that the driving circuit has to obey. The input specification distinguishes the XOR gate from other CMOS gates, the NAND and NOR gates for example. The 2-input NAND and NOR gates have two inputs not only because they realize binary logic functions but also in terms of the number of input wires that their CMOS implementations have. In contrast, we define the XOR gate as a binary logic function, draw the XOR gate symbol with two inputs, but our CMOS gate in Figure 4.15 requires four inputs. The reason is that there exists no 2-input CMOS gate for the 2-input XOR logic function, because CMOS gates can implement monotonically decreasing functions only. Hence, implementing a 2-input XOR logic gate requires a CMOS circuit rather than a CMOS gate. In this section, we discuss several CMOS implementations of the 2-input XOR logic gate.

We begin with a straightforward extension of the CMOS gate in Figure 4.15. To obtain a 2-input XOR gate, we generate the complements of inputs \(A\) and \(B\) by means of inverters. The resulting implementation of the XOR logic gate is shown in Figure 4.20.

Figure 4.20: 2-input XOR gate with input inverters.

If we wish to use the 2-input XOR gate in Figure 4.20 as a building block for the construction of larger circuits, we need to characterize its delay as a function of the electrical effort. Since the XOR logic gate consists of inverters and an XOR CMOS gate, we can scale the inverters and the XOR CMOS gate independently to minimize its delay. Since the XOR circuit is a reconverging branch, we could apply the 2d-analysis method to determine the scale factors for a given electrical effort. However, a close look reveals why a 2d-analysis may not yield the fastest design. First, notice that the circuit is symmetric in inputs \(A\) and \(B,\) if we use a symmetric XOR CMOS gate. Thus, the circuit has one path of interest, say from input \(A\) to output \(Y.\) Assume the inverter has input capacitance \(C_1.\) The XOR CMOS gate imposes a load of one pMOS and one nMOS gate capacitance on each leg of a fork driven by input \(A.\) Let \(C_2\) be the input capacitance of inputs \(A\) and \(\overline{A}\) of the XOR CMOS gate, then our XOR circuit includes a 1-fork with equal loads, as shown in Figure 4.21.

Figure 4.21: The path of interest of 2-input XOR gate in Figure 4.20 includes a 1-fork with equal loads.

Recall that we should avoid 1-forks, but let us ignore this lesson for a moment. Instead, we insist on minimizing the delay of the path from input to output in Figure 4.21 given electrical effort \(H.\) Notice that the circuit does not permit equalizing the delays of the legs of the fork. Instead, the path through the inverter leg adds the inverter delay to the delay through the XOR CMOS gate. Thus, all we can do by gate sizing is to minimize the delay of the slower leg with the inverter. This leaves us with the optimization problem of how much input capacitance to assign to the on-path inverter versus the off-path XOR CMOS gate. As discussed in fork design, we introduce factor \(\alpha\) in range \(0 < \alpha < 1\) to split the input capacitance between the inverter and the XOR CMOS gate:

\[C_1 = \alpha C_{in}\,,\qquad C_2 = (1 - \alpha) C_{in}\,,\]

such that \(C_1 + C_2 = C_{in}.\) Then, the stage efforts of the inverter and the XOR CMOS gate are:

\[f_{inv} = g_{inv} h_{inv} = g_{inv}\,\frac{C_2}{C_1} = \frac{1-\alpha}{\alpha}\,,\qquad f_{xor2} = g_{xor2} h_{xor2} = g_{xor2}\,\frac{H C_{in}}{C_2} = 2\,\frac{H}{1-\alpha}\,.\]

Now, the lower leg of the fork has delay

\[D_l = f_{xor2} + p_{xor2} = 2\, \frac{H}{1-\alpha} + 4\,,\]

and the upper leg has delay

\[D_u = (f_{inv} + p_{inv}) + (f_{xor2} + p_{xor2}) = \frac{1-\alpha}{\alpha} + 2\,\frac{H}{1-\alpha} + 5\,,\]

where \(D_u = (f_{inv} + p_{inv}) + D_l.\) We can minimize \(D_u\) by setting the derivative \(d D_u/ d \alpha\) to zero, and find

\[\alpha = \frac{\sqrt{2 H} - 1}{2 H - 1}\,.\]

Thus when using the XOR logic gate as part of a larger circuit, we can choose \(\alpha\) depending on electrical effort \(H\) to minimize its delay.

If we wish to minimize the delay of the XOR logic gate further, we could make the XOR CMOS gate asymmetric such that the logical effort of the complemented input \(\overline{A}\) is smaller than that of the uncomplemented input, so as to speed up the path through the inverter leg. Alternatively, we may recall our lesson that a 2-fork is preferable to a 1-fork. Hence, our second attempt for an XOR logic gate uses 2-forks to generate the complemented and uncomplemented inputs, as shown in Figure 4.22.

Figure 4.22: 2-input XOR gate with 2-fork inputs.

We can analyze this XOR circuit with our 2d-analysis method. The path of interest is shown in Figure 4.23, with effort delay \(d_1\) assigned to the inverters of the 2-fork, and effort delay \(d_2\) for the XOR CMOS gate.

Figure 4.23: The path of interest of the 2-input XOR gate in Figure 4.22 includes a 2-fork with equal loads.

The 2d-analysis enables us to express effort delay \(d_2\) as a function of \(d_1\):

\[d_2 = 2 H \Bigl(\frac{4}{d_1^2} + \frac{1}{d_1 + 1}\Bigr)\,,\]

such that the path effort delay becomes

\[D = d_1 + d_2 = d_1 + \frac{8 H}{d_1^2} + \frac{2 H}{d_1 + 1}\,.\]

Minimizing the path effort delay by setting the derivative \(\partial D / \partial d_1\) to zero yields the polynomial in \(d_1\):

\[d_1^5 + 2 d_1^4 + (1 - 2 H) d_1^3 - 16 H d_1^2 - 32 H d_1 - 16 H\,.\]

For a given electrical effort \(H,\) we determine delay \(d_1\) by finding the positive real root of the polynomial. Then, the minimum delay of the XOR logic gate is \(\hat{D} = d_1 + d_2 + P,\) where \(P = 2 p_{inv} + p_{xor2} = 6.\) Figure 4.28 below compares the delays of the two XOR gates with 1-fork and 2-fork inputs as a function of electrical effort \(H.\) Not unexpectedly, for all \(H\) the 2-fork design is faster than the 1-fork design. Furthermore the 1-fork design slows down much more rapidly than the 2-fork design for increasing \(H.\)

The speed difference between the 1-fork and 2-fork XOR gate designs motivates the search for even faster XOR gate circuits. In the following, we present three alternative designs. We begin with the reconverging NAND circuit in Figure 4.24 on the left. Using Boolean algebra, it is straightforward to verify that this circuit implements the 2-input XOR function. During our study of reconverging branches, we have identified the 2-fork style modification in Figure 4.24 on the right as superior because it is faster. The comparison of the delays in Figure 4.28 shows that the reconverging NAND circuit is slower than the 2-fork XOR gate circuit up to \(H = 47.\) For larger electrical efforts, the reconverging NAND circuit is faster.

xor gate from reconverging nand circuits

Figure 4.24: The reconverging branch circuit implements a 2-input XOR gate (left). The 2-fork style (right) yields a faster circuit.

Since NAND gates have not only a relatively small logical effort but also small parasitic delay compared to other 2-input gates, a 2-level NAND circuit is another top contender for an XOR gate. Again, we may use a 1-fork or the 2-fork shown in Figure 4.25 to generate the complemented and uncomplemented inputs. The 2-fork is faster than the 1-fork, as usual. The resulting XOR circuit has four stages on the 2-inverter leg, so that we expect this circuit to be better suited for larger electrical efforts.

xor gate from 2-fork 2-level nand circuit

Figure 4.25: A 2-level NAND circuit with 2-fork inputs implements a 2-input XOR gate.

To determine the minimum delay as a function of electrical effort \(H,\) we apply the 1d-analysis method with effort delay \(d\) assigned to the gates as shown in Figure 4.26. The 1d-analysis of the circuit yields the polynomial

\[d^5 + \frac{1}{2} d^4 - \frac{8}{9} H d^2 - \frac{16}{9} H d - \frac{8}{9} H\,.\]

Given electrical effort \(H,\) the positive real root of the polynomal equals effort delay \(d,\) for which the circuit has minimum path delay \(\hat{D} = 4 d + 6.\) Figure 4.28 plots \(\hat{D}\) over electrical effort \(H.\) We see that the 2-fork XOR gate is faster up to \(H = 14.\) For \(H \ge 15,\) the 4-stage NAND circuit is faster.

xor gate from 2-fork 2-level nand circuit analysis

Figure 4.26: Path of interest of 2-level NAND circuit with 2-fork inputs.

Compound gates enable us to implement two-level logic in a compact fashion. Thus, we may implement an XOR gate using OAI or AOI compound gates. Figure 4.27 shows on the left an XOR circuit design based on an OAI gate. This circuit uses a NAND gate that behaves like the inverter of a 1-fork driving the OAI gate. This circuit is indeed slower than the more complex version on the right of Figure 4.27 that uses a 2-fork structure at the inputs and a AOI gate.

2-input xor gate based on compound gates

Figure 4.27: 2-input XOR gate based on an OAI compound gate (left), and with 2-fork inputs and an AOI compound gate (right).

We analyze the AOI circuit using a 2d-analysis, by assigning effort delay \(d_1/2\) to the NAND gate and the inverter in the upper leg and effort delay \(d_1+2\) to the lower leg of the fork, and effort delay \(d_2\) to the AOI gate. Then, the 2d-analysis permits expressing \(d_1\) as a function of \(d_2.\) Minimizing the path effort delay by setting its derivative w.r.t. \(d_1\) to zero yields polynomial

\[d_1^5 + 4 d_1^4 + (4 - 2 H) d_1^3 - \frac{160}{9} H d_1^2 - \frac{640}{9} H d_1 - \frac{640}{9} H\,.\]

Given electrical effort \(H,\) the positive real roots of the polynomial yield minimum path delay \(\hat{D} = d_1 + d_2 + 16/3,\) which is plotted in Figure 4.28 as a function of \(H.\) We find that the AOI circuit has competitive delays xacross the range of \(H,\) but for no \(H\) is the AOI circuit the fastest XOR gate.

Figure 4.28: Minimum delays of XOR circuits over electrical effort.

The comparison in Figure 4.28 suggests that the fastest XOR implementation depends on electrical effort \(H.\) In particular, among the studied alternatives, the 2-fork XOR circuit of Figure 4.22 is fastest for \(H<12,\) whereas for \(H \ge 12,\) the two-level NAND circuit of Figure 4.25 is fastest.

XNOR Gates¶

The XNOR gate is very similar to the XOR gate. Logically, the XNOR gate produces the complement of the XOR gate. Thus, an \(n\)-input XNOR gate implements the even-parity operation, which outputs 1 if the number of inputs with value 1 is even. The truth table below shows the XNOR function for \(n = 3,\) \(Y = \overline{A \oplus B \oplus C},\) which equals 1 if the number of 1-inputs is 0 or 2.

A B C Y

0 0 0 1

0 0 1 0

0 1 0 0

0 1 1 1

1 0 0 0

1 0 1 1

1 1 0 1

1 1 1 0

A	B	C	Y
0	0	0	1
0	0	1	0
0	1	0	0
0	1	1	1
1	0	0	0
1	0	1	1
1	1	0	1
1	1	1	0

The CMOS implementation of a 2-input XNOR gate is shown in Figure 4.29. Like the 2-input XOR gate, it consists of four arms. The pull-up and pull-down networks are not duals of each other. Noteworthy is the fact that XOR and XNOR CMOS gates have the same topology. In particular, we do not need an output inverter to implement the complement, as required for the AND and OR gates.

A = 0 B = 0 Figure 4.29: CMOS circuit for XNOR gate and interactive switch model.

Since the topology of the XNOR gate equals that of the XOR gate, both logical effort and parasitic delay of the XNOR gate are equal to those of the XOR gate:

\[g_{xnor2} = 2\,,\qquad p_{xnor2} = 4\,.\]

XNOR gates with more than two inputs can be constructed analogous to the XOR gate. Also analogous to the XOR logic gate are the design issues of CMOS circuits for a 2-input XNOR logic gate.

4.1.6. Skewed Gates¶

When optimizing a circuit for speed, we may want the falling transition of a signal from high to low voltage to be faster than the rising transition from low to high voltage or vice versa. In this section, we discuss skewed gates whose critical transition is faster than the noncritical transition. In contrast, CMOS gates with equal rising and falling delays are unskewed or normal-skew gates. We distinguish between HI-skew gates and LO-skew gates. In a HI-skew gate the rising output transition is the faster, critical transition, and in LO-skew gates the falling output transition is critical.

Design and Analysis¶

We can skew a gate by transistor sizing. For example, consider the matched 2-input NAND gate in Figure 4.30(a). By definition, the transistors of the matched NAND gate are sized to provide equal drive currents for the rising and falling transitions. Furthermore, the drive currents are equal to those of the reference inverter. Since the delay of a gate depends on its drive current, the rising and falling delays are equal if the magnitudes of the drive currents of the corresponding transitions are equal. The key insight for designing a skewed gate is to shrink the transistors in the CMOS network that drives the uncritical transition. For example, if we wish to speed up the rising transition of the NAND gate, we shrink the nMOS transistors of the pull-down network. The effect is that the pMOS transistors deliver the same drive current as the matched NAND gate on the rising transition, but smaller nMOS transistors reduce the logical effort of the gate, resulting in a faster transition. For example, the HI-skew NAND gate in Figure 4.30(b) halves the widths of the nMOS transistors compared to the matched NAND gate to speed up the rising transition.

Figure 4.30: Skewing a 2-input NAND gate: (a) matched gate, (b) HI-skew gate, (c) down-scaled gate.

To characterize the delay of the HI-skew NAND gate, we determine its logical effort and parasitic delay. However, due to the unmatched transistor sizes, logical effort and parasitic delay differ for the rising and falling transitions. We use \(g_u\) and \(p_u\) to denote the logical effort and parasitic delay of the rising output transition driven by the pull-up network, and \(g_d\) and \(p_d\) for the falling output transition driven by the pull-down network. Then the delays of the rising and falling transitions of a skewed gate are:

\[\begin{eqnarray} d_u &=& g_u h + p_u\,, \\ d_d &=& g_d h + p_d\,, \\ \end{eqnarray}\]

where \(h\) is the electrical effort of the gate.

We consider the rising output transition first. The output of a 2-input NAND gate transitions from 0 to 1 if initially both inputs \(A\) and \(B\) are 1, and one of the inputs switches from 1 to 0. The corresponding pMOS transistor in the pull-up network drives the output. Which one of the two pMOS transistors switches does not matter in this example, because the NAND gate is symmetric. Since the width of the pMOS transistors of the HI-skew gate equals the width in the matched NAND gate, both gates produce the same drive current. Furthermore, because the matched NAND gate has the same pull-up drive current as the reference inverter, we conclude that the HI-skew gate has the same pull-up drive current as the reference inverter. Now, recall our definition of the logical effort of a CMOS gate as the ratio of its input capacitance to that of the reference inverter, assuming that the CMOS gate is sized to deliver the same drive current as the reference inverter. Since the drive currents are equal, the logical effort \(g_u\) of the rising transition of the HI-skew gate is the ratio of the input capacitance of the HI-skew gate and the input capacitance of the reference inverter. According to Figure 4.30(b), each input has input capacitance \(C_{in}(hi) = W_p + W_n = 3,\) because \(W_p = 2\) and \(W_n = 1,\) so that

\[g_u = \frac{C_{in}(hi)}{C_{in}(inv)} = \frac{3}{3} = 1\,.\]

Shrinking the pull-down transistors of the HI-skew gate retains the drive current through the pull-up network while reducing the input capacitance. As a result logical effort \(g_u = 1\) is less than the logical effort of the matched gate \(g_{nand2} = 4/3.\) Similarly, the parasitic delay of the rising transition of the HI-skew gate is

\[p_u = \frac{C_{out}(hi)}{C_{out}(inv)} = \frac{2 \cdot 2 + 1}{3} = \frac{5}{3}\,.\]

We find that parasitic delay \(p_u\) is less than that of the matched gate \(p_{nand2} = 2.\)

Next, we determine \(g_d\) and \(p_d\) of the falling transition of the HI-skew gate. To apply the definition of logical effort, we construct the down-scaled NAND gate shown in Figure 4.30(c). The down-scaled gate has the same nMOS transistor widths as the HI-skew gate, so that their drive currents are equal. We halve the width of the pull-up transistors, so that the down-scaled gate is a scaled version of the matched NAND gate with scale factor \(\gamma = 1/2.\) Now, we can argue that the HI-skew NAND gate has the same pull-down drive current as the matched NAND gate scaled by \(\gamma = 1/2,\) which in turn has the same pull-down drive current as a reference inverter scaled by \(\gamma = 1/2.\) Therefore, the logical effort of the falling transition of the HI-skew gate is the ratio of the input capacitance of the HI-skew gate and the input capacitance of the scaled reference inverter:

\[g_d = \frac{C_{in}(hi)}{\gamma C_{in}(inv)} = \frac{3}{\frac{1}{2} \cdot 3} = 2\]

for each input. The cost of the reduced logical effort \(g_u = 1\) of the rising transition is an increased logical effort \(g_d = 2\) of the falling transition, compared to \(g_{nand2} = 4/3\) of the normal-skew gate. Analogously, the parasitic delay of the falling transition of the HI-skew gate is

\[p_d = \frac{C_{out}(hi)}{\gamma C_{out}(inv)} = \frac{5}{\frac{1}{2} \cdot 3} = \frac{10}{3}\,.\]

The rising and falling logical efforts and parasitic delays are related through scale factor \(\gamma\) of the down-scaled matched gate: \(g_u = \gamma g_d\) and \(p_u = \gamma p_d.\) Notice that scale factor \(\gamma = 2\) is also the scale factor by which we shrank the pull-down nMOS transistors of the HI-skew gate to begin with. This observation simplifies the design and analysis of a skewed gate based the model of logical effort. Below is the procedure to design and analyze a HI-skew gate; the procedure for a LO-skew gate can be viewed as dual:

Choose a scale factor \(\gamma\) for the nMOS transistors of the pull-down network of the matched gate.
The logical effort \(g_u\) of the rising transition is the ratio of the input capacitance of the HI-skew gate and the reference inverter. The falling transition has logical effort \(g_d = g_u / \gamma.\)
The parasitic delay \(p_u\) of the rising transition is the ratio of the output capacitance of the HI-skew gate and the reference inverter. The falling transition has parasitic delay \(p_d = p_u / \gamma.\)

To facilitate a comparison between a HI-skew and normal-skew gates, we use the average of the rising and falling quantities:

\[g_{hi} = \frac{g_u + g_d}{2}\,,\qquad p_{hi} = \frac{p_u + p_d}{2}\,.\]

The average provides an accurate measure of the transition delay of a gate over time. When the gate is in operation as part of a larger circuit, half of all output transitions are rising and the other half are falling transitions. For the HI-skew NAND gate in Figure 4.30, the average logical effort is \(g_{hi} = 3/2\) and the average parasitic delay \(p_{hi} = 5/2.\) Thus, although our HI-skew NAND gate has a faster rising transition than the unskewed gate, both average logical effort and parasitic delay are larger than \(g_{nand2} = 4/3\) and \(p_{nand2} = 2\) of the unskewed gate.

Fast Skewed Gates¶

Our observation that the HI-skew NAND gate has a larger average delay than the unskewed NAND raises the question whether a skewed gate exists that is faster than its unskewed version. This is the case indeed. The LO-skew 2-input NOR gate in Figure 4.31(b) has a smaller average logical effort and smaller average parasitic delay than the normal-skew NOR gate.

Figure 4.31: Skewing a 2-input NOR gate: (a) matched gate, (b) LO-skew gate, (c) up-scaled gate.

The LO-skew NOR speeds up the falling transition by shrinking the pull-up pMOS transistors. In Figure 4.31, we scale the pMOS transistors of the matched NOR gate with factor \(\gamma = 3/4\) to obtain transistor width \(W_p = 3.\) Since the nMOS transistors are unchanged compared to the matched NOR gate, the drive current of the falling transition through the pull-down network of the LO-skew gate is the same as the pull-down drive current of the matched NOR gate and the reference inverter. Therefore, the logical effort of the falling transition is \(g_d = C_{in}(lo) / C_{in}(inv) = 4/3.\) The logical effort of the rising transition is then \(g_u = g_d / \gamma = 16/9.\) Note that \(g_u\) is the ratio of the input capacitances of the LO-skew gate and the \(\gamma\)-scaled reference inverter, and \(\gamma=3/4\) is equal to the scale factor of the up-scaled matched NOR gate in Figure 4.31(c). The average logical effort of the LO-skew NOR gate is \(g_{lo} = 14/9,\) which is less than the logical effort \(g_{nor2} = 5/3\) of the normal-skew gate. Similarly, we find the parasitic delays \(p_d = 5/3\) and \(p_u = 20/9.\) The average parasitic delay of \(p_{lo} = 35/18\) units is less than the parasitic delay \(p_{nor2} = 2\) of the normal-skew NOR gate.

We conclude that for all electrical efforts \(h,\) the average delay of the LO-skew NOR gate, \(d_{lo}(h) = (14/9) h + 35/18,\) is slightly less than the delay of the normal-skew NOR gate, \(d_{nor2}(h) = (5/3) h + 2.\) We can formulate a minimization problem to determine scale factor \(\gamma\) for the LO-skew NOR gate such that the average delay assumes its minimum. Using calculus, we find that scale factor

\[\gamma = \frac{1}{2} \sqrt{\frac{h+2}{h+1}}\]

minimizes the average delay if it is chosen as a function of electrical effort \(h.\) For \(h \ge 1,\) the range of \(\gamma\) is very small, \(1/2 < \gamma \le \sqrt{3/8}.\) Therefore, choosing \(\gamma \approx 1/2\) will approximate the minimum average delay reasonably well for all \(h.\) The resulting LO-skew NOR gate has pMOS widths \(W_p = \gamma 4 = 2.\) This gate is slightly faster than the matched NOR gate, because its average logical effort \(g=3/2\) is less than \(g_{nor2} = 5/3\) and the average parasitic delay \(p=2\) is the same. Before you decide to use LO-skew NOR gates in your circuits, however, notice that the speed benefit compared to the normal-skew NOR is relatively small. Furthermore, in many circuits, it is not the average delay but the worst-case delay that limits the overall performance. The worst-case delay of the LO-skew NOR is the delay of the noncritical rising transition, which is larger than the delay of the matched NOR gate.

Multistage Paths¶

The delay of an \(n\)-stage path with skewed gates depends on its output transition. Let index \(k,\) \(1 \le k \le n,\) denote the \(k^{th}\) gate from the last stage, then the path delays of the rising and falling output transition are:

\[\begin{eqnarray*} D_u &=& \sum_{\text{even}\ k} (g_{dk} h_k + p_{dk}) + \sum_{\text{odd}\ k} (g_{uk} h + p_{uk})\,, \\ D_d &=& \sum_{\text{even}\ k} (g_{uk} h_k + p_{uk}) + \sum_{\text{odd}\ k} (g_{dk} h + p_{dk})\,. \end{eqnarray*}\]

Since CMOS gates implement monotonically decreasing functions, every second stage of the path transitions in one direction, and all other stages transition in the opposite direction. Therefore, if the output of gate 1 in the last stage rises, the outputs of all gates with odd index \(k\) rise, and the outputs of the gates with even \(k\) fall. Path delay \(D_u\) is the sum of the corresponding skewed gate delays. Analogously, if the output of the last gate of the path falls, the path delay is \(D_d.\) The average path delay is a function of the average logical efforts and parasitic delays:

\[\begin{eqnarray*} D_{avg} &=& \frac{D_u + D_d}{2} \\ &=& \sum_{k=1}^n \Bigl(\frac{g_{uk} + g_{dk}}{2} h_k + \frac{p_{uk} + p_{dk}}{2}\Bigr) \\ &=& \sum_{k=1}^n (g_k h_k + p_k)\,. \end{eqnarray*}\]

If we wish to minimize the average path delay, we apply the method of logical effort for multistage paths as we know it already, except that we use the average logical efforts and parasitic delays of the skewed gates for path and gate sizing. The design procedure is the same as for unskewed gates. With skewed gates we have an additional design goal at our disposal, which is to minimize the delay of the critical transition.

Example 4.1: Path with Skewed Gates

Consider the 3-stage path in Figure 4.32 with a HI-skew NAND gate in stage 1, a LO-skew NOR gate in stage 2 and a normal-skew inverter in stage 3. Given the transistor widths in Figure 4.32 and load capacitance \(C_L = 6,\) determine the path delay of the critical rising transition, the path delay of the noncritical falling transition, and the average path delay.

Figure 4.32: A 3-stage path with a HI-skew NAND, a LO-skew NOR, and a normal-skew inverter.

Analysis of the gate logical efforts and parasitic delays yields the tabulated values.

	\(g_u\)	\(g_d\)	\(g_{avg}\)	\(p_u\)	\(p_d\)	\(p_{avg}\)
HI-skew NAND	1	2	3/2	5/3	10/3	5/2
LO-skew NOR	2	1	3/2	8/3	4/3	2
NOT	1	1	1	1	1	1

The average path delay from input \(A\) or \(B\) to output \(Y\) is the sum of the average gate delays:

\[D_{avg} = d_{avg,nand} + d_{avg,nor} + d_{avg,inv} = \Bigl(\frac{3}{2} \cdot \frac{3}{3} + \frac{5}{2}\Bigr) + \Bigl(\frac{3}{2} \cdot \frac{3}{3} + 2\Bigr) + \Bigl(1 \cdot \frac{6}{3} + 1\Bigr) = 10.5\,.\]

The minimum delay is the delay of the critical transition. The critical transition of the path is the rising output transition, which the path speeds up with the HI-skew NAND gate followed by the LO-skew NOR gate. The rising delay is

\[D_u = d_{u,nand} + d_{d,nor} + d_{u,inv} = \Bigl(1 \cdot \frac{3}{3} + \frac{5}{3}\Bigr) + \Bigl(1 \cdot \frac{3}{3} + \frac{4}{3}\Bigr) + \Bigl(1 \cdot \frac{6}{3} + 1\Bigr) = 8\,.\]

The maximum delay is the delay of the falling output transition

\[D_d = d_{d,nand} + d_{u,nor} + d_{d,inv} = \Bigl(2 \cdot \frac{3}{3} + \frac{10}{3}\Bigr) + \Bigl(2 \cdot \frac{3}{3} + \frac{8}{3}\Bigr) + \Bigl(1 \cdot \frac{6}{3} + 1\Bigr) = 13\,.\]

We find that the critical transition of the path is 2.5 delay units faster than the average delay, at the expense of the noncritical transition which is 2.5 delay units slower.

We close this section with a brief note on another practically relevant application of skewed gates, i.e. the design of gates with equal rising and falling delays if the ratio of the carrier mobilities of the CMOS fabrication process is not \(\mu_n/\mu_p = 2.\) Throughout this text, we assume that the mobility ratio is \(\mu_n/\mu_p = 2.\) Value 2 is convenient for back-of-the-envelope estimates, but is merely an approximation to reality, where typical values are in range \(2 \lesssim \mu_n/\mu_p \lesssim 3.\) In CMOS processes where \(\mu_n/\mu_p \ne 2,\) we may skew the gates to equalize the rise and fall times. The model of logical effort enables to determine the desired transistor widths, see Chapter 7 in [SSH99].

4.2. Comparators¶

Comparator circuits compare two \(n\)-bit signals \(A\) and \(B.\) The magnitude comparisons, including \(A > B\) or \(A \le B,\) are commonly implemented with an arithmetic circuit. However, for two special cases, equality comparison and equality to zero, simpler circuits exist.

4.2.1. Equality to Zero¶

Given an \(n\)-bit signal \(A,\) we wish to determine whether \(A = 0\) is true or false. Equality \(A = 0\) is true, if for each signal \(A_i,\) \(0 \le i < n,\) we have \(A_i = 0.\) Thus, we may use an \(n\)-input NOR gate to compute the equality to zero:

\[A = 0\quad \Leftrightarrow\quad \overline{A_0 + A_1 + \cdots + A_{n-1}} = 1\,.\]

Note that value \(0\) in equality \(A=0\) refers to an n-bit signal of 0’s, whereas the \(1\) in the NOR equality denotes a single bit, the output of the NOR gate.

For larger values of \(n,\) we may use a tree-structured NOR gate to minimize the delay. Figure 4.33 shows one possible tree-structure for \(n = 4,\) which is due to the Boolean identity:

\[\begin{eqnarray*} \overline{A_0 + A_1 + A_2 + A_3} &=& (\overline{A_0 + A_1}) \cdot (\overline{A_2 + A_3}) \\ &=& \overline{\overline{(\overline{A_0 + A_1}) \cdot (\overline{A_2 + A_3})}}\,. \end{eqnarray*}\]

As for any tree structure, the fastest design for an \(n\)-bit equality-to-zero comparator depends on the number of inputs \(n.\)

Figure 4.33: Equality-to-zero comparator for a 4-bit signal: (left) 4-input NOR gate, (right) one possible tree-structured NOR gate.

4.2.2. Equality Comparator¶

Given two \(n\)-bit signals \(A\) and \(B,\) we wish to determine whether \(A = B\) is true or false. The equality-to-zero comparison may be viewed as a special case of the equality comparison where \(B\) is identical to zero. Equality \(A = B\) is true if for each bit position \(i\) we have \(A_i = B_i.\) More precisely:

\[A = B\quad \Leftrightarrow\quad \forall i \in \{0, 1, \ldots, n-1\}: A_i = B_i\,.\]

Recall that the 2-input XNOR gate realizes the equality relation, \(A_i = B_i\ \Leftrightarrow\ \overline{A_i \oplus B_i} = 1.\) Thus, we may implement an equality comparator using one 2-input XNOR gate for each bit position, and an \(n\)-input AND gate to compute the all quanitfication by means of a conjunction:

\[Y = \bigwedge_{i=0}^{n-1} \overline{A_i \oplus B_i}\,.\]

Output \(Y = 1\) if \(A = B.\) For larger values of \(n,\) we implement the AND gate as a tree structure. For example, Figure 4.34 shows a 4-bit equality comparator with a NAND-NOR tree implementation of the 4-input AND gate on the right.

Figure 4.34: Equality comparator for 4-bit signals: (left) conjunction with 4-input AND gate, (right) one possible tree-structured AND gate.

4.3. Multiplexer¶

A multiplexer, or mux for short, is a circuit with \(n = 2^k\) data inputs \(D_i,\) \(0 \le i < n,\) and \(k\) select inputs \(S_j,\) \(0 \le j < k,\) that choose one of the \(n\) data inputs and steer it to output \(Y.\)

For example, a 2-input multiplexer, also called 2:1 mux to emphasize 2 inputs and 1 output, has \(n=2\) data inputs, \(D_0\) and \(D_1,\) and \(k=1\) select input \(S.\) Output \(Y\) is defined such that

\[\begin{split}Y = \begin{cases} D_0\,, & \text{if}\ S = 0\,, \\ D_1\,, & \text{if}\ S = 1\,. \end{cases}\end{split}\]

The circuit symbol of the 2:1 mux signifies that select signal \(S\) steers input \(D_0\) to output \(Y\) if \(S = 0,\) and otherwise input \(D_1.\) We can model the functional behavior of a 2:1 multiplexer using a switch with two closed positions. Note that this switch model differs from our standard model, where a switch is either open or closed.

To design a circuit for 2:1 mux, we first construct a truth table. The 2:1 mux has three inputs, \(S,\) \(D_1,\) and \(D_0,\) and one output \(Y.\) We transliterate the specification logic straight into the truth table: if \(S=0,\) then set output \(Y = D_0\) (top four rows), else if \(S=1\) set \(Y = D_1\) (bottom four rows):

S D1 D0 Y

0 0 0 0

0 0 1 1

0 1 0 0

0 1 1 1

1 0 0 0

1 0 1 0

1 1 0 1

1 1 1 1

S	D1	D0	Y
0	0	0	0
0	0	1	1
0	1	0	0
0	1	1	1
1	0	0	0
1	0	1	0
1	1	0	1
1	1	1	1

Output \(Y\) of this truth table is represented by the Boolean expression:

\[Y = \overline{S} D_0 + S D_1\,.\]

This expression suggests an implementation of the 2:1 mux using logic gates as shown on the right. The critical path of this circuit is a 3-stage path from input \(S\) through the inverter, one AND gate and the OR gate to output \(Y.\)

A faster implementation of the 2:1 mux can be based on a compound gate, or with the CMOS circuit shown in Figure 4.35. Since the circuit produces the complemented output \(Y = \overline{\overline{S} D_0 + S D_1}\) it implements an inverting 2:1 mux. Each data input drives one vertical select arm consisting of two nMOS and two pMOS transistors. The complemented and uncomplemented select inputs enable one of the select arms to drive output \(Y.\) If select input \(S=0,\) the left arm is enabled, and if \(S=1\) the right arm. The output value depends on the corresponding data input, \(D_0\) for the left arm and \(D_1\) for the right arm.

S = 0 D1 = 0 D0 = 0 Figure 4.35: CMOS circuit for inverting 2:1 mux and interactive switch model.

The inverting 2:1 multiplexer in Figure 4.35 has a remarkably flexible structure that extends to \(n\) data inputs, \(n > 0,\) even if \(n\) is not a power of 2. If we are willing to generate one select signal per data input, e.g. by means of a decoder, we can replicate the select arm \(n\) times to construct the \(n\)-way multiplexer in Figure 4.36. Select arm \(i\) drives the complement of \(D_i\) onto output \(Y,\) if select input \(S_i = 1.\) All other select signals \(S_j,\) \(j \ne i,\) must be \(S_j = 0.\) Otherwise, two enabled select arms might drive different values on output \(Y,\) effectively short circuiting \(V_{DD}\) and GND.

Figure 4.36: An \(n\)-way multiplexer with one select arm per data input.

Since each select arm consists of two series pMOS transistors in the pull-up network and two series nMOS transistors in the pull-down network, the matched \(n\)-way multiplexer has pMOS width \(W_p = 4\) and nMOS width \(W_n = 2.\) Therefore, the logical effort of each data input is \(g_{mux}(D_i) = 2,\) independent of the number of inputs. This property is unique feature of the multiplexer circuit. Unfortunately, the parasitic delay \(p_{mux}(n) = 2 n\) grows proportional to the number of inputs \(n.\) Nevertheless, we note that the effort delay of the multiplexer circuit is independent of its fan-in.

For larger numbers of inputs \(n,\) the parasitic delay of the \(n\)-way circuit in Figure 4.36 may dominate the total delay. In this case, tree-structured multiplexers provide a fast alternative. As an example, consider a 4:1 multiplexer with four data inputs and two select inputs, as shown on the left in Figure 4.37. The 4:1 multiplexer steers data input \(D_i\) to output \(Y\) if the select signal equals \(i\) in binary format. Note that \(k\) select inputs enable us to select one of \(n = 2^k\) data inputs, because a binary number with \(k\) bits can represent unsigned numbers in range \([0, 2^k-1].\) In particular, each of the \(k\) select signals may be used as select input for one of \(k\) levels in a binary tree of 2:1 muxes with \(n = 2^k\) data inputs. The multiplexer tree on the right in Figure 4.37 has \(4 = 2^k\) inputs and \(k=2\) levels.

Figure 4.37: A 4:1 multiplexer (left) built as a tree of 2:1 multiplexers (right).

The functionality of the 4:1 multiplexer is easy to verify by perfect induction. For example, if \(S = 01,\) i.e. \(S_1 = 0\) and \(S_0 = 1,\) then the level-1 output mux steers its 0-input to \(Y,\) because \(S_1 = 0.\) The 0-input of the level-1 mux is driven by the top level-0 mux, which steers \(D_1\) to its output, because \(S_0 = 1.\) Therefore output \(Y = D_1\) for \(S = 01.\) We argue analogously about the other cases of the perfect induction:

\(S_1\) \(S_0\) \(Y\)

0 0 \(D_0\)

0 1 \(D_1\)

1 0 \(D_2\)

1 1 \(D_3\)

\(S_1\)	\(S_0\)	\(Y\)
0	0	\(D_0\)
0	1	\(D_1\)
1	0	\(D_2\)
1	1	\(D_3\)

A delay analysis reveals that a \(4\)-way mux with independent select arms has a delay of \(D_{4way} = 2 H + 8\) per data input, whereas the 4:1 tree mux with two 2-way muxes on its critical path has a minimum delay of \(\hat{D}_{tree} = 4 \sqrt{H} + 8\) delay units per data input. Thus, the tree mux is faster than the 4-way circuit for electrical effort \(H > 4.\) This delay analysis neglects the fact that the 4-way mux circuit is inverting whereas the tree mux is not. We may also design tree-structured multiplexers with 4-way or 8-way muxes as building blocks. Due to the flexibility of the \(n\)-way mux circuit, the design space for tree-structured muxes is quite large. It has been shown, however, that 4-way muxes as building blocks produce generally the fastest tree structures for \(n \ge 8\) data inputs. Chapter 11 of [SSH99] studies tree structures in more detail.

4.4. Tristate Inverter and Transmission Gates¶

A single select arm of the multiplexer circuit in Figure 4.36 is a tristate inverter. The name tristate refers to the fact that the output can assume a third state in addition to the usual binary states 0 and 1. In the third state, output \(Y\) of the tristate inverter is connected neither to \(V_{DD}\) nor GND. Therefore, the output has an undetermined voltage. We say the output floats, and denote this third state with letter Z. Figure 4.38 shows that the output of the select arm floats if select signal \(S = 0.\) Otherwise, if \(S = 1,\) the arm functions like an inverter, i.e. \(Y = \overline{D}.\)

S = 0 D = 0 Figure 4.38: Interactive switch model of one multiplexer select arm.

In a multiplexer circuit, all but one select arm float. The selected arm does not float, and drives output \(Y.\) In a figurative sense, the floating arms shut up, leaving the word to the only selected arm. In an \(n\)-way multiplexer with \(n = 2^k\) arms, there is always one selected arm, and the output never floats. In general, this is the expected behavior from a multiplexer, that an implementation with logic gates produces as well. Tristate inverters expand the applicability of select arms, for example as drivers for shared buses where all arms may float.

To arrive at the traditional circuit realization of a tristate inverter, we play the following circuit trick. Note that we can introduce a new wire in the select arm, shown in Figure 4.39(b), without affecting its functionality. Now, imagine we could pull output \(Y\) to the right like a rubber band. Then, we obtain the topologically equivalent circuit in Figure 4.39(c).

transform select arm into tristate inverter

Figure 4.39: Three equivalent circuits: (a) select arm, (b) select arm with additional wire \(X,\) (c) inverter driving a transmission gate.

The parallel composition of an nMOS and pMOS transistor in Figure 4.39(c) is a transmission gate with the function:

\[\begin{split}Y = \begin{cases} X\,, & \text{if}\ S = 1\,, \\ Z\,, & \text{if}\ S = 0\,. \end{cases}\end{split}\]

The transmission gate disconnects its two terminals, if \(S=0,\) and connects its two terminals if \(S = 1.\) In Figure 4.39(c), terminal \(X\) is driven by an inverter, such that \(X = \overline{D}.\) If \(S = 0,\) output \(Y\) is disconnected from \(X,\) and floats. Otherwise, if \(S=1,\) output \(Y\) is connected to terminal \(X,\) and the inverter drives output \(Y = \overline{D}\) through the transmission gate.

This circuit is called tristate inverter, and is commonly defined as a 2-input gate, with a distinguished enable input. The symbol of the tristate inverter is shown on the left. It assumes that the enable signal is complemented internally. The function of the tristate inverter is defined as

\[\begin{split}Y = \begin{cases} A\,, & \text{if}\ EN = 1\,, \\ Z\,, & \text{if}\ EN = 0\,. \end{cases}\end{split}\]

To analyze the logical effort of the tristate inverter in Figure 4.39(c), we need to understand the pass characteristics of the transmission gate. To that end, we introduce a refinement of our simple transistor switch model. The key to understanding the transmission gate is the threshold voltage \(V_t,\) which is a characteristic process parameter of a transistor. For today’s nMOS transistors, \(V_{tn} = +V_t\) is in range \(V_{DD}/4 \lesssim V_t \lesssim V_{DD}/3\) and for pMOS transistors \(V_{tp} \approx -V_t\) has opposite polarity.

Figure 4.40: Refined switch model of MOS transistors for ON position. Due to a threshold voltage (\(V_t\)) drop, nMOS transistors pass a weak 1 and pMOS transistors a weak 0. However, since nMOS transistors pass a strong 0 they are suited for pull-down networks, and since pMOS transistors pass a strong 1 they are suited for pull-up networks of CMOS circuits.

An nMOS transistor is switched off if the voltage between gate and source is smaller than the threshold voltage, \(V_{gs} < V_t.\) This cut-off effect influences the current between source and drain, \(I_{ds},\) as well as the terminal voltages in a more differentiated fashion than our simple switch model suggests. Figure 4.40 shows two cases where the source terminal of the nMOS transistor is connected to GND (left), and where the drain terminal is connected to \(V_{DD}\) (right). We assume the gate voltage of the nMOS transistor equals \(V_g = V_{DD}.\) In our simple switch model \(V_g = V_{DD},\) i.e. g = 1 in the digital abstraction, closes the nMOS transistor. Now, consider the case where the source is tied to GND, \(V_s = 0,\) see Figure 4.40 on the left. The voltage between gate and source is \(V_{gs} = V_g - V_s = V_{DD},\) which is larger than threshold voltage \(V_t.\) The current \(I_{ds}\) flowing from drain to source depends on drain voltage \(V_d\) and on-resistance \(R_n\) of the transistor, such that \(I_{ds} = V_d / R_n\) according to our transistor RC model. In case where \(I_{ds} = 0,\) the transistor pulls the drain voltage to ground, i.e. \(V_d = V_{ds} = 0.\) This is the behavior of an nMOS transistor that we know from the pull-down network of the CMOS inverter, for example.

In Figure 4.40 on the right, both gate and drain terminals of the nMOS transistor are tied to \(V_{DD},\) so that \(V_g = V_d = V_{DD}.\) In this circuit, current \(I_{ds}\) depends on source voltage \(V_s.\) Since the transistor is switched on only if \(V_{gs} \ge V_t,\) the source voltage of the closed transistor cannot rise to \(V_{DD}\) without external force. Instead, \(V_{gs} \ge V_t\) implies \(V_{ds} \ge V_t\) by KVL, so that \(V_s \le V_{DD} - V_t.\) The source-drain voltage suffers the so-called threshold drop, which effectively reduces on-current \(I_{ds}.\) In the digital abstraction we say that the nMOS transistor pulls the source terminal to a weak 1 at the drain, because the maximum source voltage is by \(V_t\) smaller than drain voltage \(V_{DD}.\) In contrast, shown in Figure 4.40 on the left, the nMOS transistor pulls the drain terminal to a strong 0 at the source, because the minimum drain voltage equals source voltage 0. This behavior is the reason why we use nMOS transistors in pull-down but not in pull-up networks of CMOS circuits.

The analogous effect occurs in pMOS transistors, where all polarities are reversed. The threshold voltage of a pMOS transistor is negative, so that a pMOS transistor is closed if \(V_{gs} \le -V_t,\) and is open if \(0 \ge V_{gs} > -V_t.\) The pMOS transistor pulls the source terminal to a weak 0 at the drain, because the minimum source voltage is by \(V_t\) larger than \(V_d = 0,\) see Figure 4.40 on the left. However, a pMOS transistor pulls the drain terminal to a strong 1 at the source, because the maximum drain voltage is \(V_{DD}\) if \(V_{gs} = -V_{DD}\) which is smaller than \(-V_t,\) see Figure 4.40 on the right. Thus, pMOS transistors a suited for pull-up but not for pull-down networks of CMOS circuits.

Figure 4.41: The pass transistor model abstracts the refined switch model in Figure 4.40.

Figure 4.41 shows the pass transistor model as a convenient abstraction of the refined switch model. We say that an nMOS transistor passes a strong 0 from input (source) to output (drain), but a weak 1. Analogously, a pMOS transistor passes a strong 1 from input (drain) to output (source), but a weak 0. Since transistors are symmetric, source and drain can be used interchangeably as inputs or outputs. According to the pass transistor model we can use either an nMOS or a pMOS transistor as a switch to pass an input signal to the output. However, neither passes both inputs 0 and 1 strongly to the output. The transmission gate uses a parallel composition of nMOS and pMOS transistors, so that one of the transistors passes the input strongly, see Figure 4.42.

Figure 4.42: The transmission gate passes both a strong 0 via the nMOS and a strong 1 via the pMOS transistor.

When a transistor passes a strong 0 or 1, it acts like a closed switch in the RC model. Current \(I_{ds}\) is determined by the on-resistance. However, when passing a weak 0 or 1, current \(I_{ds}\) is effectively smaller than passing a strong 0 or 1. We model this behavior by means of an increased on-resistance. As a reasonable estimate, we assume that the on-resistance is twice as large when passing a weak versus a strong input. Then, given on-resistances \(R_n\) and \(R_p\) of the simple switch model, we define the weak and strong on-resistances as

\[\begin{eqnarray*} R_{n,strong}\ =\ R_n\,, &\quad & R_{n,weak}\ =\ 2 R_n\,, \\ R_{p,strong}\ =\ R_p\,, &\quad & R_{p,weak}\ =\ 2 R_p\,. \end{eqnarray*}\]

The on-resistances enable us to model the transmission gate as a resistive switch circuit, as shown in Figure 4.43.

Figure 4.43: Resistive switch model of transmission gate passing a 0 (left) and a 1 (right), assuming normalized widths \(W_p = W_n = 1.\)

Given unit-sized transistors, we have \(R_p = 2 R_n,\) because of mobility ratio \(\mu_n / \mu_p = 2.\) Then effective on-resistance for passing 0 is:

\[R_{on}(0) = \frac{2 R_p R_n}{2 R_p + R_n} = \frac{4}{5} R_n\,,\]

and for passing 1:

\[R_{on}(1) = \frac{R_p 2 R_n}{R_p + 2 R_n} = R_n\,.\]

Since the difference is small, we approximate the on-resistance of the transmission gate with unit-sized transistors to be \(R_n\) in both cases when passing 0 or 1. With this approximation, we arrive at the matched tristate inverter shown on the right. The equivalent tristate inverter circuit in Figure 4.39(c) consists of a series composition of an inverter and a transmission gate. By doubling the size of both, the tristate inverter has the same pull-up and pull-down drive currents as the reference inverter. Thus, the logical effort of the tristate inverter is \(g_{tri}(A) = 2\) for input \(A,\) and \(g_{tri}(EN) = g_{tri}(\overline{EN}) = 2/3\) for the complemented and uncomplemented enable inputs. The parasitic delay is \(p_{tri} = 4/3.\)

4.5. Encoder¶

In the broad sense, an encoder is a circuit that transforms its inputs into a code word of a given code. In the narrow sense of digital circuits, an encoder commonly denotes a circuit that transforms a one-hot coded word into a binary coded word. In this section, we discuss the one-hot to binary encoder and another useful circuit, the priority encoder.

4.5.1. One-Hot to Binary Encoder¶

The one-hot code with \(n\) bits \(x_i,\) where \(0 \le i < n,\) sets bit \(x_i = 1\) in code word \(X(i),\) and all other bits \(x_j = 0\) for \(j \ne i.\) For example, the one-hot code with \(n=4\) bits defines 4 code words \(X(i) = x_3 x_2 x_1 x_0\) for \(0 \le i < 4\):

\(i\) \(x_3\) \(x_2\) \(x_1\) \(x_0\)

0 0 0 0 1

1 0 0 1 0

2 0 1 0 0

3 1 0 0 0

\(i\)	\(x_3\)	\(x_2\)	\(x_1\)	\(x_0\)
0	0	0	0	1
1	0	0	1	0
2	0	1	0	0
3	1	0	0	0

The one-hot to binary encoder is a circuit with \(2^n\) inputs \(A_i,\) where \(0 \le i < 2^n,\) and \(n\) outputs \(Y_j,\) where \(0 \le j < n,\) such that output \(Y = k\) in binary format if input \(A_k = 1.\) For example, this truth table defines a 4:2 encoder for \(n=2\):

\(A_3\) \(A_2\) \(A_1\) \(A_0\) \(Y_1\) \(Y_0\)

0 0 0 1 0 0

0 0 1 0 0 1

0 1 0 0 1 0

1 0 0 0 1 1

\(A_3\)	\(A_2\)	\(A_1\)	\(A_0\)	\(Y_1\)	\(Y_0\)
0	0	0	1	0	0
0	0	1	0	0	1
0	1	0	0	1	0
1	0	0	0	1	1

Figure 4.44 shows two alternative circuits for the 4:2 encoder.

Figure 4.44: Two implementations of a 4:2 encoder based on OR-gates (left) and NOR-gate (right).

When building larger encoders, each of the \(n\) outputs requires an OR or NOR gate with \(2^{n-1}\) inputs. The load presented to the inputs of the encoder differs substantially, between driving just one and driving \(n\) output gates. For example, a 16:4 encoder based on OR gates has the output equations:

\[\begin{eqnarray*} Y_0 &=& A_1 + A_3 + A_5 + A_7 + A_9 + A_{11} + A_{13} + A_{15} \\ Y_1 &=& A_2 + A_3 + A_6 + A_7 + A_{10} + A_{11} + A_{14} + A_{15} \\ Y_2 &=& A_4 + A_5 + A_6 + A_7 + A_{12} + A_{13} + A_{14} + A_{15} \\ Y_3 &=& A_8 + A_9 + A_{10} + A_{11} + A_{12} + A_{13} + A_{14} + A_{15} \end{eqnarray*}\]

Input \(A_1\) drives only one OR gate of output \(Y_0.\) In contrast, input \(A_{15}\) is connected to all four OR gates. Therefore, the electrical effort of \(A_{15}\) is 4 times larger, and signal changes on input \(A_{15}\) suffer a larger delay than on input \(A_1.\) The method of logical effort enables us to assess whether we can speed up the slow inputs, for example by inserting buffers. However, rather than equalizing the delays of the encoder inputs, we prefer holding the driving circuit responsible for coping with the various load capacitances presented by the encoder inputs.

4.5.2. Priority Encoder¶

An \(n\)-bit priority encoder is a circuit with \(n\) inputs \(A_i\) and \(n\) outputs \(Y_i,\) where \(0 \le i < n,\) such that

\[\begin{split}Y_i = \begin{cases} 1\,, & \mbox{if}\ A_0 = \ldots = A_{i-1} = 0\ \mbox{and}\ A_i = 1\,, \\ 0\,, & \mbox{otherwise}. \end{cases}\end{split}\]

Informally, a priority encoder identifies the first in the sequence of input bits with value 1. For example, here is the truth table of a 3-bit priority encoder:

\(A_0\) \(A_1\) \(A_2\) \(Y_0\) \(Y_1\) \(Y_2\)

0 0 0 0 0 0

0 0 1 0 0 1

0 1 0 0 1 0

0 1 1 0 1 0

1 0 0 1 0 0

1 0 1 1 0 0

1 1 0 1 0 0

1 1 1 1 0 0

\(A_0\)	\(A_1\)	\(A_2\)	\(Y_0\)	\(Y_1\)	\(Y_2\)
0	0	0	0	0	0
0	0	1	0	0	1
0	1	0	0	1	0
0	1	1	0	1	0
1	0	0	1	0	0
1	0	1	1	0	0
1	1	0	1	0	0
1	1	1	1	0	0

The inputs \(A_i\) are ordered according to their index \(i.\) Output \(Y_i = 1,\) if \(A_0 = 0\) and \(A_1 = 0\) up to \(A_{i-1} = 0\) and input \(A_i = 1.\) The values of the remaining inputs \(A_{i+1}, \ldots, A_{n-1}\) does not matter for output \(Y_i.\) The circuit on the left of Figure 4.45 implements this logic for output \(Y_i\) with an \((i+1)\)-input AND gate, except for \(Y_0.\)

Figure 4.45: Two implementations of a 5-bit priority encoder with one AND gate per output (left) and a daisy chain (right).

The daisy chain circuit on the right of Figure 4.45 uses less area than the AND gate circuit on the left, at the expense of a larger propagation delay. As a rough estimate of the area requirements, let us count inverters and 2-input gates to assess the area requirements of the two circuits. Given 2-input gates only, we implement all larger AND gates of the circuit on the left as tree gates. Furthermore, assume that we make the complemented and uncomplemented inputs available with 2-forks, resulting in three inverters per input, for a total of \(3 n\) inverters for \(n\) 2-forks. A tree-structured AND-gate with \(k\) inputs requires approximately \(k/2 + (k/2-1) \approx k\) 2-input NAND gates and the same number of inverters. Thus, all of the AND gates of an \(n\)-bit priority encoder together require roughly

\[\sum_{k=1}^{n-1} k = \frac{1}{2} n (n-1)\]

2-input NAND gates and inverters, respectively. Therefore, the total number of gates of the priority inverter on the left in Figure 4.45 is approximately \(n^2/2\) 2-input NAND gates and \(n^2/2 + 3 n\) inverters. Asymptotically, for large \(n,\) the number of gates grows quadratically, a scary fact. But, the circuit can be quite fast, because the critical path is proportional to the height of the largest AND gate of output \(Y_{n-1},\) which consists of about \(\lg n/2\) 2-input AND gates only.

The daisy chain circuit on the right of Figure 4.45 uses only a linear number of gates. Each inner stage of the chain consists of two NAND gates and three inverters. The first stage needs only on inverter and the last stage one NAND gate and one inverter. Therefore, the total number of gates of an \(n\)-bit priority encoder with daisy chain structure is \(2 (n-2) + 1 = 2 n - 3\) 2-input NAND gates and \(3 (n-2) + 1 = 3 n - 5\) inverters. For large \(n,\) the daisy chain occupies significantly less area than the AND gate design. The price we pay for the area efficiency is the propagation delay of the critical path, which stretches from input \(A_0\) to output \(Y_{n-1}.\) There are \(n-1\) NAND gates and \(n\) inverters on this path. A path with \((2n-1)\) stages requires a path effort of about \(3.59^{2n-1}\) to be optimally sized. For large \(n,\) this path effort would be humongous. In all likelihood, practical designs will have to settle on a suboptimal delay that is dominated by the gates on the critical path.

If we wish to design a wide priority encoder with a large number of inputs and outputs \(n,\) then the AND gate design is not an option because of its quadratic area requirements, and the daisy chain would be slow. Thus, we may want to reduce the number of gates on the chain. We can halve the number of gates by using the inverting chain in Figure 4.46. The NAND gate in odd stage \(2i-1\) of the chain computes the complement \(\overline{\overline{a_0} \cdot \overline{a_1} \cdot \ldots \cdot \overline{a}_{2i-1}}\) and the NOR gate in even stage \(2i\) of the chain computes the uncomplemented conjunction \(\overline{a_0}\cdot \overline{a_1} \cdot \ldots \cdot \overline{a}_{2i}\) of the complemented input signals.

Figure 4.46: A 7-bit priority encoder with inverting daisy chain.

Example 4.2: Priority Encoder

We wish to size the gates of the 7-bit priority encoder in Figure 4.46 in order to minimize its delay. We proceed in two steps. First, we obtain a rough estimate for the delay of the chain assuming fixed gate sizes, and, second, we size the gates on the daisy chain according to the method of logical effort.

To get a feel for the delay of the daisy chain assume the load capacitance of each output is \(C_L(Y_i) = 36.\) Furthermore, all gates shall be scaled by factor \(\gamma = 2.\) Thus, each inverter has input capacitance \(C_{inv} = 6,\) each NAND gate input presents capacitance \(C_{nand} = 8\) and each NOR gate input \(C_{nor} = 10.\) Our path of interest is the critical path from input \(A_0\) to output \(Y_6.\) The logical effort of the path with 1 inverter, 3 NAND gates, and 3 NOR gates is \(G = (4/3)^3 (5/3)^3 = 10.97.\) The branching effort of all stages except the last is \(b = 2,\) for a path branching effort of \(B = 2^5.\) The electrical effort is \(H = C_L/C_{in} = 36/6 = 6.\) Thus, the path effort of the critical path is \(F = GBH = 2{,}107.\) According to the path sizing calculator, the best number of stages of an inverter chain for this stage effort is \(n^\star = 6.\) Thus, the number of stages on the critical path of our daisy chain, \(n=7,\) is by no means unreasonable. The delay computation of our priority encoder design is detailed in Table 4.1.

Table 4.1: Delay estimates for priority encoder.
\(\mbox{stage}\)	\(C_{in}\)	\(C_{out}\)	\(h\)	\(g\)	\(f\)	\(p\)
1	6	16	8/3	1	2.67	1
2	8	20	5/2	4/3	3.33	2
3	10	16	8/5	5/3	2.67	2
4	8	20	5/2	4/3	3.33	2
5	10	16	8/5	5/3	2.67	2
6	8	10	5/4	4/3	1.67	2
7	10	36	3.6	5/3	6	2
total					22.33	13

Table 4.1 suggests obvious improvements to our circuit. To obtain minimum delay, all stages should bear the same effort delay. Stages 6 and 7 present outliers with \(f_7=6\) and \(f_6 = 1.67\) deviating the most from the average \(22.33/7=3.2.\) We can fix the imbalances between the stage efforts by gate sizing. To minimize the delay of the chain, we should spread path effort \(F = 2{,}107\) across the \(n=7\) stages of the chain. Thus, with an effort delay of \(\hat{f} = 2{,}107^{1/7} = 2.98\) per stage, the total effort delay would be \(n \hat{f} = 20.9\) time units. The resulting improvement compared to our initial design is less than \(7\,\%.\) However, we may even do better if we also resize the output gates.

The lesson we have learned from branching circuits is that the delay on the path of interest can be reduced either by decreasing the off-path capacitance or by increasing the on-path capacitance or both. The method of logical effort also teaches us that each stage of a fast design should bear effort \(f^\star = 3.59.\) If the NOR gates that drive off-path outputs \(Y_2\) and \(Y_4\) bear best stage effort \(f^\star = g h = (5/3) (36/C_{in}(nor)),\) they should have input capacitance \(C_{in}(nor) = 16.7.\) This is almost twice of our initial assumption, and increases the branching effort compared to our initial design. We lower the branching effort of the path of interest by reducing the NOR gate sizes of outputs \(Y_2\) and \(Y_4\) from the original value of \(C_{in}(nor) = 10\) slightly to \(C_{in}(nor) = 8.\) The off-path AND gates, i.e. the NAND-INV pairs, that drive outputs \(Y_1,\) \(Y_3,\) and \(Y_5\) have two stages each. Thus, if both NAND gate and inverter bear best stage effort \(f^\star,\) then the NAND gates can have input capacitance \(C_{in}(nand) = 3.72.\) If we use minimum sized NAND gates with \(C_{in}(nand) = 4,\) we can reduce the branching effort of the path. As a result, we obtain the critical path of the priority encoder shown below, where the input capacitances \(C_2, \ldots, C_7\) of the NAND and NOR gates are yet to be determined:

critical path of priority encoder daisy chain

In general, to obtain the minimum delay for such a problem, we need to iterate the gate sizing. For the first iteration, we proceed as follows. We assume that the branching efforts in each stage is \(b = 2.\) Then, we know path effort \(F=2{,}107\) already, and the stage effort for minimum delay should be \(\hat{f} = 2.98,\) which is reasonably close to \(f^\star = 3.59,\) because the number of stages is close to the optimum. Working from the output towards the input, we can compute the input capacitances of each gate based on the relation for \(\hat{f} = g_i h_i = g_i C_{out}/C_{in}\):

\(i\)	\(1\)	\(2\)	\(3\)	\(4\)	\(5\)	\(6\)	\(7\)
\(C_i\)	3.45	6.29	6.06	6.83	7.27	9.0	20.1
\(b_i\)	1.64	2.32	1.58	2.1	1.44	1	1

Given the gate input capacitances, we can compute the branching efforts, and find that our assumption of branching effort \(b_i = 2\) is almost fulfilled, but only on average. The branching effort of our sized path is \(B' = 18.24\) rather than 64. Thus, given \(B' = 18.24\) the path effort is \(F' = (4/3)^3 (5/3)^3 B' 36/6 = 1{,}201\) only, and the stage effort for minimum path delay should be \(\hat{f} = F'^{1/7} = 2.75\) rather than assumed value \(\hat{f} = 2.98.\) Nevertheless, our design is pretty close to the minimum delay without further iterations already. Table 4.2 summarizes the resulting delays.

Table 4.2: Delay estimates for optimized priority encoder.
\(\mbox{stage}\)	\(C_{in}\)	\(C_{out}\)	\(h\)	\(g\)	\(f\)	\(p\)
1	6	10.29	1.71	1	1.71	1
2	6.29	14.06	2.23	4/3	2.98	2
3	6.06	10.83	1.79	5/3	2.98	2
4	6.83	15.28	2.23	4/3	2.98	2
5	7.27	13.01	1.79	5/3	2.98	2
6	9.01	20.13	2.23	4/3	2.98	2
7	20.13	36	1.79	5/3	2.98	2
total					19.6	13

We find that gate sizing reduces the delay of the original priority encoder from \(35.3\) to \(32.6\) time units. This result confirms that our initial attempt was pretty fast already.

4.6. Decoder¶

A decoder computes the inverse transform of an encoder. In the realm of digital circuits, the term decoder commonly refers to a circuit that transforms a binary code into a one-hot code.

A binary to one-hot decoder is a circuit with \(n\) inputs \(A_i,\) where \(0 \le i < n,\) and \(2^n\) outputs \(Y_j,\) where \(0 \le j < 2^n,\) that asserts output \(Y_k\) if the binary coded value of the input equals \(k.\) For example, the truth table below specifies a 2:4 decoder with \(n=2\) inputs and \(2^n = 4\) outputs, such that output \(Y_k = 1\) if input \(A = k\) for \(0 \le k < 4.\)

\(A_1\) \(A_0\) \(Y_3\) \(Y_2\) \(Y_1\) \(Y_0\)

0 0 0 0 0 1

0 1 0 0 1 0

1 0 0 1 0 0

1 1 1 0 0 0

\(A_1\)	\(A_0\)	\(Y_3\)	\(Y_2\)	\(Y_1\)	\(Y_0\)
0	0	0	0	0	1
0	1	0	0	1	0
1	0	0	1	0	0
1	1	1	0	0	0

Figure 4.47: Implementation of a 2:4 decoder.

Figure 4.47 shows the implementation of the 2:4 decoder. Since each output represents one minterm, we use one 2-input AND gate per output. In general, for a \(n{:}2^n\) decoder, the circuit requires \(2^n\) AND gates with \(n\) inputs each. Each complemented and uncomplemented input drives \(2^{n-1}\) AND gates.

The primary use of decoders is as address decoders for memories. For example, a register file with 8 registers employs a 3:8 decoder to select 1-out-of-8 registers. The address input is 3 bits wide, with a binary code to cover address range \([0, \ldots, 7].\) The 3:8 decoder transforms the binary address into a one-hot code that asserts exactly 1 of its 8 outputs associated with the selected register. A gate-level implementation of a 3:8 decoder is shown in Figure 4.48 on the left. We use 2-forks to provide each input in complemented and uncomplemented polarity. Since each input drives 4 NAND gates, the loads of the legs are equal. Assuming also that the loads of the outputs \(Y\) are equal, the decoder circuit is symmetric. Thus, we may minimize the delay of the decoder by analyzing the path of interest, actually the fork of interest, shown on the right-hand side of Figure 4.48. Each input has capacitance \(C_{in} = C_1 + C_2.\) The load capacitance at each output shall be \(C_L = H C_{in}.\) Each leg of the fork drives 4 NAND gates, one of which is on the path of interest, whereas the others branch off the path of interest.

Figure 4.48: Gate-level implementation of 3:8 decoder (left) and path of interest for delay analysis (right).

We use the 1d-analysis method to minimize the delay of the circuit. The delays allocated to the individual gates are included in Figure 4.48. For the upper leg of the fork, we multiply the effort delays to obtain:

\[\begin{eqnarray*} d^2 (d+1) &=& G B \frac{H C_{in}}{C_1} \\ &=& g_{nand3} \cdot \frac{4\,C_3}{C_3} \cdot \frac{H C_{in}}{C_1} \\ &=& \frac{20}{3}\, \frac{H C_{in}}{C_1}\,, \end{eqnarray*}\]

because \(g_{nand3} = 5/3.\) For the lower leg, the partial products of the effort delays yield:

\[\begin{eqnarray*} \Bigl(\frac{d}{2}\Bigr)^2 &=& \frac{4 C_3}{C_2} \\ d^2 &=& g_{nand3} \cdot \frac{H C_{in}}{C_3} \\ \Rightarrow\qquad \frac{d^4}{4} &=& \frac{20}{3}\, \frac{H C_{in}}{C_2}\,. \end{eqnarray*}\]

Since \(C_{in} = C_1 + C_2,\) we obtain the polynomial in \(d\):

\[d^5 + d^4 - \frac{20}{3} H d^2 - \frac{80}{3} H d - \frac{80}{3} H\,.\]

Given electrical effort \(H,\) we can determine effort delay \(d(H)\) as the larger of two positive real roots of the polynomial. The minimum delay of the 3:8 decoder is then \(\hat{D} = 3 d(H) + P,\) where \(P = 6.\) Figure 4.49 plots the minimum delay for \(1 \le H \le 100.\)

Figure 4.49: Minimum delay \(\hat{D}\) of 3:8 decoder as a function of electrical effort \(H.\)

Our 3:8 decoder is essentially a 3-stage design, which is best suited for path effort \(F = 3.59^3 = 46.\) For larger decoders or circuits with larger electrical effort \(H,\) we may increase the number of stages either by appending inverters to the outputs or by prepending inverters to the inputs. Furthermore, for large \(n\) we may implement the \((n/2)\)-input NAND gates using a tree structure.

4.7. Memory Elements¶

All of the logical circuits we have discussed so far are acyclic, that is they do not form cycles or loops except implicitly through the power supply. Acyclic circuits have inputs \(A\) and outputs \(Y,\) and implement logic functions of the inputs, \(Y = f(A).\) When the inputs change, the outputs assume the values defined by \(f\) after a circuit specific propagation delay. Cyclic circuits exhibit a more complex, time dependent behavior than their acyclic cousins, because their outputs depend on the input sequence including current and past input values. Cyclic circuits are of particular interest for the implementation of memory elements. In this section, we discuss two of the most commonly used types of memory elements, the D-latch and the D-flipflop.

4.7.1. Cyclic Circuits¶

A cyclic circuit has one or more loops, not counting implicit power supply loops. For example, Figure 4.50 shows two cyclic circuits with one loop each.

Figure 4.50: Cyclic circuits with one loop of two (left) and three (right) inverters.

Cyclic circuits are more difficult to analyze than acyclic circuits. In particular, the circuits shown in Figure 4.50 do not even have distinguished input and output terminals. The inverter pair on the left of Figure 4.50 has two wires: \(Q_0\) connects the output of inverter 0 to the input of inverter 1, and \(Q_1\) connects the output of inverter 1 to the input of inverter 0. Analogously, the three-inverter cycle on the right has three wires, \(Q_0,\) \(Q_1,\) and \(Q_2.\)

Let us analyze the functionality of the two-inverter loop. The output of each inverter can be 0 or 1. First, assume that \(Q_0 = 0.\) Thus, the input of inverter 1 is 0, hence \(Q_1 = 1.\) Since \(Q_1\) is the input of inverter 0, the output of inverter 0 must be \(Q_0 = 0.\) This is the same value that we started with. We conclude, that the circuit reinforces \(Q_0 = 0\) and, by symmetry, \(Q_1 = 1.\) Second, assume that \(Q_0 = 1.\) Then the output of inverter 1 is \(Q_1 = 0,\) which causes the output of inverter 0 to be \(Q_0 = 1.\) This is also the same value with started out with. We observe that the two-inverter loop reinforces its state in either case \(Q_0 = 0\) or \(Q_0 = 1.\)

The three-inverter loop behaves very differently than the two-inverter loop. Assume that \(Q_0 = 0.\) Since \(Q_0\) is the input of inverter 1, the output of inverter 1 must be \(Q_1 = 1.\) Since \(Q_1\) is the input of inverter 2, its output must be \(Q_2 = 0.\) Now, input \(Q_2\) is the input of inverter 0. Therefore, the output of inverter 0 must be \(Q_0 = 1,\) which is the complement of the assumption we started with. If we traverse the loop for a second time, we find \(Q_0 = 0,\) that is the wire is complemented again. We observe that the circuit does not reinforce its state. Instead, the values of the wires toggle at a speed determined by the propagation delay of the inverters.

We conclude that the two-inverter loop is stable whereas the three-inverter loop is not. In fact, these behaviors can be observed in larger loops as well. Every inverter loop with an even number of inverters is stable. In contrast inverter loops with odd numbers of inverters as unstable. Odd-numbered inverter loops are called ring oscillators, and are useful to measure the average propagation delay of an inverter in a given manufacturing process. Ring oscillators are not used as logic elements in digital circuits. In contrast, loops with even numbers of inverters are useful building blocks, because they are bistable. A bistable circuit is a circuit with two stable states. A state is stable, if the circuit does not transition into another state without external stimulus. The two stable states of the bistable two-inverter loop are \(S_0 = \{Q_0=0, Q_1=1\}\) and \(S_1 = \{Q_0 = 1, Q_1 = 0\}.\) When the circuit assumes a stable state, the inverters enforce that \(Q_0\) and \(Q_1\) are complements of each other, \(Q_0 = \overline{Q_1}.\) Thus, in Figure 4.51, we call the wires simply \(Q\) and \(\overline{Q}.\)

Figure 4.51: The two states of a bistable inverter loop.

The problem with the bistable inverter loop in Figure 4.51 is that we cannot control its state. Since the circuit is stable and has no inputs, it is not obvious how to transition the circuit from one stable state into the other. The D-latch, that we discuss below, fixes this deficiency.

4.7.2. D-Latch¶

A D-latch is a bistable memory element with data input \(D,\) clock input \(\phi,\) and output \(Q.\) The D-latch symbol is shown on the right. Figure 4.52 shows a D-latch implementation with a bistable inverter loop and a 2:1 multiplexer to steer the data input signal into the loop.

Figure 4.52: A D-latch implemented as a bistable inverter loop with an input multiplexer.

Clock input \(\phi\) of the D-latch serves as select input of the multiplexer, and determines whether data input \(D\) asserts the state of the bistable inverter loop. Hence, we distinguish two modes of operation:

\(\phi = 1\): D-latch is transparent.

The inverter loop is open, and input \(D\) drives the inverter pair. Since output \(Q\) follows input \(D,\) we say the D-latch is transparent.

\(\phi = 0\): D-latch is opaque.

The inverter loop is closed, and stores the current state. Since output \(Q\) retains its value, independent of input \(D,\) we say the D-latch is opaque.

Figure 4.53 shows the D-latch circuit with the multiplexer replaced by its switch model. The mode of operation depends on the position of the multiplexer switch. If \(\phi = 1,\) the loop is open, and input \(D\) drives both output \(Q\) and the inverter pair. Otherwise, if \(\phi = 0,\) the closed loop is disconnected from input \(D,\) and retains value \(Q\) because the loop is a bistable cyclic circuit.

Figure 4.53: Modes of operation of the D-latch. The D-latch is either (left) transparent: output \(Q\) follows input \(D,\) or (right) opaque: the inverter loop stores value \(Q.\)

The waveform diagram in Figure 4.54 illustrates the operation of a D-latch over time. The diagram shows the voltage levels of the clock and data inputs and the D-latch output. Initially, the D-latch is opaque and stores output value \(Q = 0.\) Input \(D\) transitions to 1 before the clock signal. When the D-latch becomes transparent, the output follows input \(D\) after a propagation delay, indicated by the curved arrow from \(\phi\) to \(Q.\) The D-latch remains transparent as long as the clock signal is \(\phi = 1.\) Output \(Q\) follows input \(D\) after a propagation delay, indicated by the curved arrows from the transitions of \(D\) to \(Q.\) When the clock transitions to \(\phi = 0,\) the D-latch stores the last value of input \(D\) before the clock transition, and holds this value for as long as the D-latch remains opaque. The D-latch is called level-sensitive because its two modes of operation depend on the level of the clock signal.

Figure 4.54: Waveform diagram of D-latch. When the D-latch is transparent, output \(Q\) follows input \(D.\) Otherwise, during the gray shaded time intervals, the D-latch is opaque, i.e. holds output \(Q\) unchanged and blocks \(D\) from propagating to the output.

In the following, we derive a CMOS circuit for the D-latch in Figure 4.52. We begin with the implementation of the 2:1 multiplexer with two select arms, integrating the \(\overline{\phi}\)-arm into the inverter loop as shown in Figure 4.55. Since this multiplexer circuit inverts the output, we add an inverter to generate uncomplemented output \(Q.\) Figure 4.55 shows the multiplexer select arms as tristate inverters in form of an inverter that drives a transmission gate.

Figure 4.55: D-latch implementation with multiplexer select arms shown as tristate inverters.

The functionality of the D-latch circuit depends essentially on the state of internal node \(X.\) When the \(\phi\)-arm is closed, it steers input \(D\) to \(X.\) Since the inverting mux arm changes the polarity, we have \(X = \overline{D}.\) The output inverter drives \(X\) to output \(Q,\) so that \(Q = \overline{X} = D\) when clock input \(\phi = 1,\) and the D-latch is transparent. Otherwise, when \(\phi = 0,\) the \(\phi\)-arm is open and the \(\overline{\phi}\)-arm is closed, then node \(X\) is disconnected from input \(D\) and the inverter loop reinforces \(X = \overline{Q}.\) The D-latch is opaque and output \(Q\) retains its value.

A compact CMOS implementation of the mux arms is shown in Figure 4.56. We exploit the circuit equivalence in Figure 4.39 to implement each mux arm with four transistors.

Figure 4.56: D-latch implementation with exploded multiplexer select arms.

Figure 4.57 shows the complete 12-transistor CMOS circuit for the D-latch, including an interactive switch model. Note that the output of the feedback inverter is \(Y = \overline{X},\) and the circuit also enforces \(X = \overline{Q}\) and \(Y = Q.\)

ϕ = 0 D = 0 Figure 4.57: CMOS circuit for D-latch and interactive switch model.

The timing behavior of the D-latch depends on the mode of operation. When the D-latch is transparent, the critical path stretches from data input \(D\) to output \(Q.\) The signal propagates through internal node \(X,\) and bypasses the feedback loop entirely. The delay is independent of clock signal \(\phi.\) When the D-latch is opaque, the output remains unchanged. Hence, it does not make sense to speak of a propagation delay. Therefore, we discuss the propagation delay of the transparent D-latch. In particular, we analyze the critical path of the D-latch with the matched transistor sizes shown in Figure 4.58.

Figure 4.58: D-latch circuit with matched transistor sizes.

The propagation delay of the transparent D-latch is the delay of the 2:1 multiplexer plus the delay of the output inverter: \(d_{D\rightarrow Q} = d_{mux} + d_{inv}.\) Assuming that output \(Q\) drives capacitive load \(C_L,\) the output inverter has electrical effort \(h_{inv} = C_L/C_{inv}\) and parasitic delay \(p_{inv}=1,\) such that \(d_{inv} = C_L/3 + 1\) time units. Next, we analyze the multiplexer. The logical effort is the input capacitance \(C_{mux}(D)\) of input \(D\) divided by the input capacitance of a reference inverter \(C_{inv}=3.\) Input \(D\) drives the pMOS transistor of width 4 and the nMOS transistor of width 2 of the \(\phi\)-arm. Thus, \(C_{mux}(D) = 6,\) and the logical effort of input \(D\) of the mux is \(g_{mux}(D) = 6/3 = 2.\) To determine the electrical effort of the mux, notice that output \(X\) of the mux drives two inverters, the feedback inverter and the output inverter. Thus, the load capacitance of the mux is \(C_L(mux) = 2 C_{inv} = 6.\) Therefore, the electrical effort of input \(D\) of the mux is \(h_{mux}(D) = C_L(mux)/C_{mux}(D) = 6/6 = 1.\) The parasitic delay of the 2:1 mux is \(p_{mux} = 4,\) see multiplexer. We find that the delay of the multiplexer is \(d_{mux} = g_{mux}(D) h_{mux}(D) + p_{mux} = 2 \cdot 1 + 4 = 6,\) and the propagation delay of the D-latch amounts to

\[d_{D\rightarrow Q} = d_{mux} + d_{inv} = 6 + \frac{C_L}{3} + 1 = \frac{C_L}{3} + 7\,.\]

This delay may serve as a point of reference for circuit optimizations. For example, we observe that the feedback inverter diverts current from the multiplexer output. If we shrink the pMOS transistor of the feedback inverter from 2 to 1 units of normalized width, we reduce the capacitance of the off-path branch and the delay of the path of interest.

Propagation delay \(d_{D\rightarrow Q}\) is only one of several characteristic quantities of the timing behavior of the D-latch. Arguably even more important is the timing behavior of the D-latch when the clock input changes. More succinctly, if both the clock and data inputs change at about the same time, the D-latch may become unstable. Digital circuits with D-latches should avoid this scenario by all means because it can impact the functionality adversely. Functional bugs caused by careless timing behavior are particularly difficult to uncover. We discuss the timing behavior of the D-latch in Figure 4.59, where data input \(D\) transitions at time \(t = 0\) and clock input \(\phi\) at time \(t = 14.\)

Figure 4.59: D-latch circuit and timing analysis. The clock input transitions from 1 to 0 at time \(t = 14.\) The D-latch is transparent for \(t < 14\) and opaque for \(t > 14.\) The opaque D-latch stores input value \(D=1.\)

Given the D-latch transistor sizes in Figure 4.58 and a load capacitance of \(C_L = 12,\) the delays of the circuit elements of the D-latch are:

\(d_{D\rightarrow X}\): \(\quad \phi\)-arm delay; \(d_{D\rightarrow X} = d_{mux} = 6\)

\(d_{X\rightarrow Y}\): \(\quad\) feedback inverter delay; \(d_{X\rightarrow Y} = 6/3 + 1 = 3\)

\(d_{Y\rightarrow X}\): \(\quad \overline{\phi}\)-arm delay; \(d_{Y\rightarrow X} = d_{mux} = 6\)

\(d_{X\rightarrow Q}\): \(\quad\) output inverter delay; \(d_{X\rightarrow Q} = d_{inv} = C_L/3 + 1 = 5\)

With these element delays, we can express the propagation delay of the transparent D-latch as \(d_{D\rightarrow Q} = d_{D\rightarrow X} + d_{X \rightarrow Q} = 11.\) The waveform diagram shows the corresponding transitions. Initially, the D-latch is transparent because \(\phi = 1,\) and the \(\phi\)-arm is closed whereas the \(\overline{\phi}\)-arm is open. At time \(t=0,\) the \(D\)-input changes from 0 to 1. Output \(Q\) will follow \(D\) after the input has propagated through the \(\phi\)-arm to internal node \(X\) at \(t = 6,\) and then through the output inverter at time \(t = 11.\) The change of internal node \(X\) also affects node \(Y,\) which changes at time \(t = 9\) from 0 to 1 after propagation delay \(d_{X\rightarrow Y}\) of the feedback inverter.

When the D-latch becomes opaque at time \(t = 14,\) it stores value \(Q = 1.\) Clock signal \(\phi = 0\) closes, or turns on, the \(\overline{\phi}\)-arm and opens, or turns off, the \(\phi\)-arm. Thus, after the clock transition at \(t=14,\) signal \(Y=1\) takes propagation delay \(d_{Y\rightarrow X} = 6\) time units through the inverting \(\overline{\phi}\)-arm to reinforce internal node \(X=0\) at time \(t = 20.\) It is this switching delay of the multiplexer that can cause trouble. In particular, the transition of input \(D\) must occur by a sufficiently long time period before the negative transition of clock \(\phi\) to stabilize internal node \(Y\) through the feedback inverter, because the feedback loop is bistable only if \(X = \overline{Y}.\) If we force \(X = Y,\) the feedback loop will assume an unpredictable state. This can happen, if the time interval \(d_{D\rightarrow \phi}\) between the transitions of inputs \(D\) and \(\phi\) is too small. Figure 4.60 illustrates the timing problems of a D-latch.

Figure 4.60: D-latch timing problems: (left) in corner case \(d_{D\rightarrow \phi} = d_{D\rightarrow X} + d_{X\rightarrow Y}\) the D-latch stores the input after \(d_{D\rightarrow Q} = d_{D\rightarrow X} + d_{X\rightarrow Q},\) (middle) \(d_{D\rightarrow X} < d_{D\rightarrow \phi} < d_{D\rightarrow X} + d_{X\rightarrow Y}\) causes output \(Q\) to follow \(D\) after a glitch increases the propagation delay, (right) \(d_{D\rightarrow \phi} < d_{D\rightarrow X}\) misses the input transition of \(D\) because \(D\) changes too close to the negative edge of clock \(\phi.\)

In the waveform diagram on the left of Figure 4.60, the interval between the transition of input \(D\) and clock \(\phi\) is \(d_{D\rightarrow \phi} = d_{D\rightarrow X} + d_{X\rightarrow Y}.\) At time \(t = 9,\) the negative clock edge begins to close the \(\overline{\phi}\)-arm and to open the \(\phi\)-arm. Internal node \(Y\) follows \(D\) just in time at \(t = 9\) to reinforce internal node \(X\) after the multiplexer delay of \(d_{Y\rightarrow X} = 6\) time units, so that \(X = 0\) for \(t > 15.\) Time interval \(d_{D\rightarrow \phi} = d_{D\rightarrow X} + d_{X\rightarrow Y}\) is the smallest interval for the D-latch to capture input \(D\) safely.

The waveform diagram in the middle of Figure 4.60 assumes that interval \(d_{D\rightarrow \phi}\) is smaller, that is \(d_{D\rightarrow X} < d_{D\rightarrow \phi} < d_{D\rightarrow X} + d_{X\rightarrow Y}.\) After the negative clock edge at time \(t = 8,\) the feedback loop enters unstable state \(X = Y = 0.\) After the multiplexer delay of \(d_{Y\rightarrow X} = 6\) time units, at time \(t = 14,\) the inverting \(\overline{\phi}\)-arm drives value \(\overline{Y} = 1\) onto inner node \(X.\) However, since the feedback inverter produces output value \(Y = 1\) just 1 time unit after the negative clock edge at time \(t = 9,\) the \(\overline{\phi}\)-arm pulls inner node \(X\) back to 0 at time \(t = 15.\) The resulting glitch on node \(X\) propagates through the feedback and output inverters. Since inverters attenuate such glitches, the feedback loop stabilizes and recovers \(X = 0.\) However, the glitch appears at output \(Q\) after propagation delay \(d_{glitch} = d_{D\rightarrow \phi} + D_{Y\rightarrow X} + d_{X\rightarrow Q},\) which is larger than propagation delay \(d_{D\rightarrow X} + d_{X\rightarrow Q}\) of the transparent D-latch.

The waveform diagram on the right in Figure 4.60 assumes an even smaller interval \(d_{D\rightarrow \phi} < d_{D\rightarrow X}.\) The negative clock edge opens the \(\phi\)-arm too early to pull inner node \(X\) to 0. The \(\phi\)-arm fights the closing \(\overline{\phi}\)-arm, which succeeds to pull \(X\) to 1. As a consequence, the D-latch fails to capture input \(D=1\) and continues to store the old value \(Q = D = 0.\) The output inverter propagates the glitch of node \(X\) to output \(Q\) before restoring the old output value \(Q = 0.\)

Figure 4.61: D-latch timing: the smaller the interval between input transition of \(D\) and the negative clock edge, \(d_{D\rightarrow \phi},\) becomes, the larger the propagation delay from input \(D\) to output \(Q,\) \(d_{D\rightarrow Q},\) becomes. When the interval becomes too small, the D-latch fails to capture the input transition entirely, see case (4).

Figure 4.61 summarizes the timing behavior of the D-latch in a graph that plots interval \(d_{D\rightarrow \phi}\) on the horizontal axis and propagation delay \(d_{D\rightarrow Q}\) on the vertical axis. Case (1) corresponds to the scenario in Figure 4.59, where the transition of input \(D\) occurs sufficiently early before the negative clock edge for the D-latch to propagate the change safely to output \(Q.\) The propagation delay is the sum of the delays of the \(\phi\)-arm and the output inverter, \(d_{D\rightarrow Q} = d_{D\rightarrow X} + d_{X\rightarrow Q}.\) Senarios (2), (3), and (4) correspond to the three cases illustrated in Figure 4.60. The closer the transition of input \(D\) gets to the negative clock edge, the larger propagation delay \(d_{D\rightarrow Q}\) becomes. When interval \(d_{D\rightarrow \phi}\) becomes too small, the D-latch fails to capture the new input signal altogether, and retains the old input value. We observe that for a safe operation of the D-latch, we need to guarantee that input \(D\) changes sufficiently early before the negative clock edge, i.e. interval \(d_{D\rightarrow \phi}\) must be large enough.

The setup time \(t_{setup}\) is the minimum time input \(D\) must be stable before the negative clock edge to capture the input value within a reasonable delay \(d_{D\rightarrow Q}.\)

For example, reasonable specifications define \(t_{setup}\) to be \(5\,\%\) larger than the propagation delay of the transparent D-latch, \(t_{setup} = 1.05\,(d_{D\rightarrow X} + d_{X\rightarrow Q}).\)

Figure 4.62: D-latch setup time and hold time characterize the time interval around the negative clock edge where input \(D\) may change safely. The negative clock edge occurs at \(d_{D\rightarrow \phi} = 0.\)

The timing problems of the D-latch occur even when the transition of input \(D\) occurs after the negative clock edge. If input \(D\) transitions before the \(\phi\)-arm is completely open, glitches can propagate to output \(Q.\) Figure 4.62 illustrates delay \(d_{D\rightarrow Q}\) as a function of interval \(d_{D\rightarrow \phi}.\) Analogous to the setup time, for a glitch-free operation of the D-latch, we must ensure that input \(D\) does not transition until a time period after the negative clock edge has passed.

The hold time \(t_{hold}\) is the minimum time input \(D\) must be stable after the negative clock edge to capture the input value within a reasonable delay \(d_{D\rightarrow Q}.\)

Setup time and hold time characterize the timing behavior of the D-latch around the negative clock edge. Most manufacturers offer D-latches as basic circuit elements, and supply their process specific setup and hold times in their datasheets. For the circuit designer, it is important to ensure that the data input does not change within the interval around the negative clock edge of the D-latch. This is the reason why the clock input of the D-latch is commonly connected to the regular beat of a clock signal with a well defined clock period. Such a clock signal restricts the design choices but gives the designer a clean time reference for the permitted delay range of the circuit that drives the data input of a D-latch.

4.7.3. D-Flipflop¶

A D-flipflop is a bistable memory element with data input \(D,\) clock input \(\phi,\) and output \(Q.\) A D-flipflop is an edge-triggered memory element that is activated by a clock edge. At the triggering clock edge, the D-flipflop stores input \(D\) until the next triggering clock edge occurs. This behavior is different from a D-latch which is level-sensitive. Whereas a D-latch is opaque when the clock input level is low, the D-flipflop is always opaque except for the short time period around the triggering clock edge. Nevertheless, the D-flipflop can be build by a series composition of two D-latches with complemented clock inputs, as shown in Figure 4.63. The D-flipflop symbol on the right has a triangle at the clock input to indicate that the flipflop is edge triggered.

Figure 4.63: A positive-edge triggered D-flipflop implemented with two back-to-back D-latches and complemented clocks.

The first D-latch with input \(D\) and output \(Q1\) is called the master latch, and is negative-sensitive, because it is transparent when clock input \(\phi = 0.\) The second D-latch with input \(Q1\) and output \(Q\) is the slave latch, and is positive-sensitive, because it is transparent when clock input \(\phi = 1.\) The D-flipflop is positive-edge triggered because it stores input \(D\) at the rising edge of the clock. If we invert the clock input, the flipflop would be negative-edge triggered, and store input \(D\) at the falling edge of the clock. Figure 4.64 shows a CMOS circuit for the positive-edge triggered D-flipflop. The circuit saves four transistors by removing the output inverter of the master latch and the input inverter of the slave latch.

Figure 4.64: A 20-transistor CMOS circuit for the positive-edge triggered D-flipflop saves the output inverter of the master latch and the input inverter of the slave latch.

To understand the functionality of the D-flipflop, consider the switch model and the associated waveform diagrams in Figure 4.65 and Figure 4.66. Figure 4.65 illustrates the operation during the negative half-cycle of the clock. The master latch is transparent, and input \(D\) can propagate to inner node \(Q1.\) Since the slave latch is opaque, input \(D\) cannot propagate beyond \(Q1\) to output \(Q.\) Instead, the slave latch reinforces output value \(Q.\)

Figure 4.65: Switch model of D-flipflop and operation during negative half-cycle of the clock. The master latch is transparent and the slave latch is opaque.

During the positive half-cycle of the clock, illustrated in Figure 4.66, the master latch is opaque, and disconnects input \(D\) from inner node \(Q1.\) Thus, the master latch stores the last value of input \(D\) before the positive clock edge. The slave latch is transparent, and propagates the value of \(Q1\) to output \(Q,\) which is the last value of input \(D\) before the positive clock edge.

Figure 4.66: Switch model of D-flipflop and operation during positive half-cycle of the clock. The master latch is opaque and the slave latch is transparent.

Note that output \(Q\) remains unchanged after the negative clock edge, because the opaque slave latch stores during the negative half-cycle the value that the master latch stored during the preceding positive half-cycle of the clock.

The triggering clock edge is the timing critical transition of the D-flipflop. In case of the positive-edge triggered D-flipflop, the rising clock edge turns into the falling clock edge of the master latch. This is the critical clock edge, where the setup time and hold time of the master D-latch must be observed. Thus, the D-flipflop is subject to the same timing constraints as a D-latch. Input signal \(D\) must be stable during the setup time before the positive clock edge and during the hold time after the positive clock edge.

The D-flipflop is also called register, and is one of the most widely used memory elements in digital circuit design. A single D-flipflop implements a 1-bit register. An \(n\)-bit register consists of \(n\) D-flipflops triggered by the same clock signal. Each of the \(n\) D-flipflops has an independent D-input and Q-output. The register symbol is shown on the right.

4.2

A D-latch is transparent when clock \(\phi = 1\) and opaque when \(\phi = 0.\) A positive-edge triggered D-flipflop stores input \(D\) at the rising edge of clock signal \(\phi.\) Complete the waveform diagram below with D-latch output \(Q_\text{latch}\) and D-flipflop output \(Q_\text{ff},\) assuming both memory elements are connected to clock \(\phi\) and input signal \(D.\)

D-latch output \(Q_\text{latch}\) follows input \(D\) while clock \(\phi = 1.\) At the negative clock edge, when the clock transitions from 1 to 0, the D-latch retains the value input \(D\) has at the transition.

D-flipflop output \(Q_\text{ff}\) changes at the positive clock edges only, i.e. where the clock signal transitions from 0 to 1. There, it stores the value of input \(D,\) and retains this value for the entire clock cycle until the next positive clock edge occurs.

4.8. Wires¶

In circuit diagrams we draw wires to signify the connectivity between terminals of transistors, gates, and larger circuits. Real wires, however, exhibit a more complex timing behavior than meets the eye. We distinguish three wire models:

Ideal wire: The wire has negligible resistance and capacitance that we can ignore for the purpose of delay analysis. In practice, this simple model is relevant for very short wires only.

Capacitive wire: The wire has negligible resistance but significant capacitance. This model applies to medium length wires. For delay analysis, we can model the wire capacitance as a lumped capacitance in parallel to the load capacitance of the gate driving the wire. The method of logical effort captures stray capacitances due to wires without changes.

RC wire: The wire has significant resistance and capacitance. Long wires have significant delay which can be modeled as resistances and capacitances distributed along the extend of the wire. In state-of-the-art VLSI chips long wires have delays spanning tens of clock cycles, and have even become performance limiting. Modeling long wires is beyond the capabilities of the method of logical effort.

In this section, we introduce the Elmore delay as an alternative model for the timing analysis of RC circuits, and show that we can reduce the signal propagation delay of long wires from a quadratic function in the wire length to a linear function by using repeaters.

4.8.1. Elmore Delay¶

When a wire is so long that it incurs a delay comparable or even larger than the delay of its driver, we need to account for the wire delay if we wish to minimize the delay of the whole circuit. Figure 4.67 illustrates the situation with a long wire driven by an inverter and load capacitance \(C_L.\) We can model the resistance and capacitance of the wire by considering an infinitesimally short segment. Then, the methods of calculus apply, and the model for the wire delay results in the famous diffusion equation, which does not have analytic solutions for the problem at hand. We could solve the diffusion equation numerically, but a more insightful model can be derived from an approximation of the wire by means of finite resistances and capacitances, lumped into discrete resistors and capacitors, as shown in Figure 4.67.

Figure 4.67: RC model of a long wire driven by an inverter and load capacitance \(C_L.\)

The Elmore delay approximates the delay of an RC wire with \(n\) segments very accurately as the sum of time constants:

\[t_{RC} = \sum_{i=1}^n \Biggl( \sum_{j=1}^i R_j \Biggr) C_i\,.\]

Figure 4.68 illustrates the indexing used in this sum for \(1 \le i \le 4.\) Each index corresponds to a branching node with an off-path capacitance. If we index the resistors and capacitors as shown in Figure 4.68, we can interpret the Elmore delay as the sum of the \(RC\) constants induced by the currents shown in the RC model. The total drive current flowing from the inverter into the wire is \(i_1 + i_2 + i_3 + i_4.\) By KCL, \(i_1\) charges capacitor \(C_1,\) whereas current \(i_2 + i_3 + i_4\) flows into the remaining wire segments. The RC constant of the first segment is \(R_1 C_1.\) Current \(i_2\) charges capacitor \(C_2.\) Since \(i_2\) flows through \(R_1\) and \(R_2,\) the time constant for the second segment is \((R_1 + R_2) C_2.\) For the whole wire model in Figure 4.68, the Elmore delay amounts to:

\[t_{RC}\ =\ R_1 C_1 + (R_1 + R_2)\, C_2 + (R_1 + R_2 + R_3)\, C_3 + (R_1 + R_2 + R_3 + R_4)\, C_4 \ =\ \sum_{i=1}^4 \Biggl(\sum_{j=1}^i R_j\Biggr) C_i\,.\]

Figure 4.68: Wire RC model with currents for Elmore delay approximation.

The Elmore delay enables us to derive the delay of a wire as a function of its length. Assume that a wire segment of length \(\Delta L\) has lumped resistance \(R\) and capacitance \(C.\) In Figure 4.68, we assume that \(R_i = R\) and \(C_i = C\) for all nodes \(i.\) Then, the Elmore delay of a wire with \(n\) segments and length \(L = n \Delta L\) is:

\[\begin{eqnarray*} t_{RC}(n) &=& \sum_{i=1}^n \Biggl(\sum_{j=1}^i R\Biggr) C \\ &=& \sum_{i=1}^n (i R) C \\ &=& \frac{1}{2} n (n + 1) RC\,. \end{eqnarray*}\]

We conclude that the Elmore delay is proportional to the square of the number of wire segments \(n\) and is, hence, proportional to the square of the wire length, \(t_{RC}(L) \propto L^2.\) Intuitively, this result is obvious considering (1) that the total resistance of a wire \(R_{wire} = n R\) grows linearly in \(L,\) and (2) the total capacitance \(C_{wire} = n C\) grows linearly in \(L\) viewing the wire as one plate of a capacitor and the rest of the chip as the other. Therefore, product \(RC\) grows quadratically in \(L.\)

4.8.2. Repeater Insertion¶

If we want to reduce the delay of a long wire, increasing the size of the driver alone is not an effective method. However, if we use repeaters such as inverters or buffers at carefully chosen locations distributed across the wire, we can reduce the wire delay significantly. More succinctly, we divide a wire of length \(L\) into \(n\) segments of length \(\Delta L = L/n,\) and insert a repeater at the beginning of each segment, as illustrated in Figure 4.69. Using inverters as repeaters minimizes the additional delay introduced by the repeaters themselves. Nevertheless, to reduce the overall delay, we need to strike the proper balance in the number of repeaters. Too few repeaters, and the wire delay remains quadratic in wire length. Too many repeaters, and the wire delay is dominated by the delay of the inverter chain.

Figure 4.69: Wire model with inverters as repeaters.

To find the optimal number of repeaters, consider RC model of a single wire segment in Figure 4.70. We assume that the wire of length \(L\) with total resistance \(R_{wire}\) and capacitance \(C_{wire}\) has \(n\) segments of length \(\Delta L = L/n,\) such that each wire segment has resistance \(R_{wire}/n\) and capacitance \(C_{wire}/n.\) The RC model of the repeating inverter shall have on-resistance \(R_{inv}\) and output capacitance \(C_{inv}.\) Assuming that all inverters have the same size, the load capacitance of the wire segment is the input capacitance of the inverter, \(C_{inv}.\)

Figure 4.70: RC model of one repeated wire segment.

The Elmore delay of one wire segment of length \(\Delta L\) including repeating inverter is

\[\begin{eqnarray*} t_{RC}(\Delta L) &=& R_{inv} C_{inv} + \Bigl(R_{inv} + \frac{R_{wire}}{n}\Bigr)\,\Bigl(C_{inv} + \frac{C_{wire}}{n}\Bigr) \\ &=& 2 R_{inv} C_{inv} + \frac{1}{n} (R_{inv} C_{wire} + R_{wire} C_{inv}) + \frac{1}{n^2} R_{wire} C_{wire}\,. \end{eqnarray*}\]

Therefore, the Elmore delay of the whole repeated wire of length \(L\) is

\[t_{RC}(L)\ =\ n\,t_{RC}(\Delta L)\ =\ 2 n R_{inv} C_{inv} + R_{inv} C_{wire} + R_{wire} C_{inv} + \frac{1}{n} R_{wire} C_{wire}\,.\]

Since the optimal number of repeaters \(n\) minimizes the wire delay \(t_{RC}(L),\) we set the derivative to zero:

\[\frac{\partial t_{RC}(L)}{\partial n}\ =\ 2 R_{inv} C_{inv} - \frac{1}{n^2} R_{wire} C_{wire}\ \stackrel{!}{=}\ 0\,,\]

and find the optimal number of repeaters

\[n = \sqrt{\frac{R_{wire} C_{wire}}{2 R_{inv} C_{inv}}}\,.\]

With this value of \(n,\) and utilizing the observation that the total resistance and capacitance of the wire are proportional to its length \(L,\) i.e. \(R_{wire} \propto L\) and \(C_{wire} \propto L,\) we find that the minimum delay of the repeated wire is proportional to \(L\):

\[\begin{eqnarray*} \min t_{RC}(L) &=& \sqrt{8 R_{inv} C_{inv} R_{wire} C_{wire}} + R_{inv} C_{wire} + R_{wire} C_{inv} \\ &\propto& \sqrt{L \cdot L} + L + L \\ &\propto& L\,. \end{eqnarray*}\]

Here, we assume that the inverter size and, hence, \(R_{inv}\) and \(C_{inv}\) are constants, independent of wire length \(L.\) We conclude, that the delay of long wires does not need to be proportional to the square of their length but, at the expense of inserting repeaters, can be reduced to be directly proportional to their length.

Footnotes

[1]	The number of inputs of a gate is also called fan-in. For instance, a NAND gate with \(n\) inputs has a fan-in of \(n.\)

Introduction to Digital Circuits