The concept of improving the timing behavior of a circuit by relocating flip-flops is called retiming and was first presented by Leiserson and Saxe. The ASTRA algorithm proposed an alternative view of retiming using the equivalence between retiming and clock skew optimization. This work defines the relationship between the Leiserson-Saxe and the ASTRA approaches and utilizes it to solve the problem of retiming for minimum area. The new algorithm, Minaret, uses the linear programming formulation of the Leiserson-Saxe approach. The underlying philosophy of the ASTRA approach is incorporated to reduce the number of variables and constraints in the linear program. This reduction in the size of the linear program makes Minaret space and time efficient, enabling minimum area retiming of circuits with over 56,000 gates in under 15 minutes.
Controller circuits synthesized from high-level languages often have many more latches than the minimum, with a resulting sparse reachable state space that has a particular structure. We propose an algorithmthat exploits this structure to remove latches. The reachable state set (RSS) is much easier to compute for the new, smaller circuit and can be used to efficiently compute the RSS of the original. Thus we provide a method for obtaining the RSS, and two different initial implementations from which to begin logic optimization.
The objective of this paper is to provide an effective technique for accurate modeling of the external input sequences that affect the behavior of Finite State Machines (FSMs). The proposed approach relies on adaptive modeling of binary input streams as Markov sources of fixed-order. The input model itself is derived through a one-pass traversal of the input sequence and can be used to generate an equivalent sequence, much shorter in length compared to the original sequence. The compacted sequence can be subsequently used with any available simulator to derive the steady-state and transition probabilities, and the total power consumption in the target circuit. As the results demonstrate, large compaction ratios of orders of magnitude can be obtained without a significant loss (less than 3% on average) in the accuracy of estimated values.
This paper presents a novel technique for synthesis of speed-in-dependent circuits. It is based on partial order representation of the state graph called STG-unfolding segment. The new method uses approximation technique to speed up the synthesis process. The method is illustrated on the basic implementation architecture. Experimental results demonstrating its efficiency are presented and discussed.
This paper presents a technique, alternative to performance-driven synthesis, that allows to drastically increase the average throughput of combinational logic blocks by transforming fixed-latency units into variable-latency ones that run with a faster clock cycle. The transformation is fully automatic and can be used in conjunction with traditional design techniques, such as pipelining, to improve the overall performance of speed-critical systems. Results, obtained on a large set of benchmark circuits, are very promising.
CAD tools and research in the area of reduced-order modeling of large linear interconnect networks have evolved from merely finding a Padé approximation for the given network transfer function to finding an approximate transfer function that preserves such circuit-theoretic properties of the network as stability, passivity, and RLC synthesizability. In particular, preserving passivity guarantees that the reduced-order models will be well-behaved when embedded back in the circuit where the interconnect network originated. While stability can be ascertained by studying the poles of the reduced-order transfer function, passivity depends on both the poles and zeros of the network driving-point impedance. In this paper, we present a novel method for studying the zeros of reduced-order transfer functions and show how it yields conclusions about passivity and synthesizability. Moreover, in order to obtain a guaranteed-passive reduced-order model for multiport RC networks, a new algorithm based on the Arnoldi iteration is presented. This algorithm is as computationallyefficient as the one used to generate guaranteed-stable reduced-order models [1].
None of the existing network reduction tools preserve passivity for RLC networks. The loss of passivity can be a serious problem because simulations of the reduced networks may encounter time step too small errors. This paper presents a set of transformations called Split Congruence Transformations (SCTs) which can be used to accurately reduce a RLC network while preserving passivity.
This paper presents a new Gaussian quadrature method for interconnect modeling which applies to one-dimensional distributions of circuit element values, the line structures often used to model interconnect wires. The method takes a line circuit, which may be an arbitrary combination of lumped and distributed elements, and produces a small lumped model whose transfer and input characteristics approximately match those of the original circuit. The same algorithm may be used to generate lumped models for distributed elements, or to reduce long chains of lumped elements. The method also applies to one-dimensional distributions of coupling circuit elements, and, in general, to any one-dimensional distribution of circuit element quantities. Several examples demonstrate that the quadrature-based method is capable of automatically generating compact RC, LC and RLC line models with arbitrary accuracy.
In this paper we develop a gate level model that allows us to determine the best and worst case delay when there is dominant interconnect coupling. Assuming that the gate input windows of transition are known, the model can predict the worst and best case noise, as well as the worst and best case impact on delay. This is done in terms of a Ceff based gate model under general RC interconnect loading conditions.
Task scheduling for reactive real time systems is a difficult problem due to tight constraints that the schedule must satisfy. A static priority scheme is proposed here that can be formally validated. The method is applicable both for preemptive and non-preemptive schedules and is conservative in the sense that a valid schedule may be declared invalid, but no invalid schedule may be declared valid. Experimental results show that the run time of our validation method is negligible with respect to other steps in system design process, and compares favorably with other methods of schedule validation.
This paper introduces a basic mixed integer-linear model (MILP) to design application-specific heterogeneous multiprocessors (ASHM) allowing imprecise computation of tasks executed in a non-preemptive mode. The proposed model was used in the development of a genetic algorithm integrated into the tool set MEGA that uses soft computing techniques for design of optimal/near-optimal ASHMs subject to constraints on performance, cost and output-data quality.
The paper presents an algorithm to determine the close-to-smallest possible data buffer sizes for arbitrary synchronous data flow (SDF) applications, such that we can guarantee the existence of a deadlock free schedule. The presented algorithm fits in the design flow of GRAPE, an environment for the emulation and implementation of digital signal processing (DSP) systems on arbitrary target architectures, consisting of programmable DSP processors and FPGAs. Reducing the size of data buffers is of high importance when the application will be mapped on Field Programmable Gate Arrays (FPGA), since register resources are rather scarce.
Reactivity is one of the key features of hardware description languages. We present an efficient implementation of reactivity in the Scenic framework that allows the system designer to model hardware blocks. Scenic allows the designer to use C++ to model mixed hardware/software systems with a C++ compiler and a small library and without the need of a complex event-driven run-time kernel often found embedded in hardware description languages (HDL) such as VHDL and Verilog. Moreover, Scenic hardware descriptions can be easily mapped to HDL and synthesized into hardware implementations using commercially available tools. In this paper we present Scenics implementation of concurrency (signals and processes) and reactivity (waiting and watching). When C++ is used as an HDL, context-switching overhead can become a significant performance issue during simulation. We introduce the notion of delayed expression objects, or lambdas, to reduce context-switching. Examples and experimental results are presented to show the utility and simulation efficiency using the Scenic framework.
Designing for low power has become increasingly important in a wide variety of applications, including wireless telephony, mobile computing, high performance computing, and high speed networking. Despite reductions in power supply voltages, power consumption continues to rise and demands increased support from EDA tools and methodologies. Various tools have emerged to address different levels of the power problem, yet conventional methodologies often focus on the low leverage aspects. This paper will survey existing commercial tools used in low power design and present them in the context of an architecture focused low power design methodology.
As the complexity of high-performance microprocessor increases, functional verification becomes more and more difficult and RTL simulation emerges as the bottleneck of the design cycle. In this paper, we suggest C language-based design and verification methodology to enhance the simulation speed instead of the conventional HDL-based methodologies. RTL C model( StreC) describes the cycle-based behaviors of synchronous circuits and is followed by model refining and optimization using LifeTime Analyzer( LTA ) and Cleaner . The simulation speed of cycle-based C model makes it possible to test the RTL design with the 'real-world' application programs in the order-of-magnitude faster speed than the commercial event-driven simulators. Using the proposed functional verification methodology, HK486, an intel 80486-compatible microprocessor was successfully designed and verified.
In this paper an approach is presented for the hierarchical verification of the memory control units, I/O adapters and processor interconnect units as found in multiprocessor computer systems. It is shown how such units could be verified better and faster by the introduction of random executable timing diagrams and associated CAD tool support. Furthermore, it is shown how the timing diagrams for the unit network verification are easily derived from the timing diagrams specified for the units. The multiprocessor hardware test showed the effectiveness of the proposed verification approach.
This paper describes the use of a high-level view (functional view) of a clock
regenerator circuit for generating effective and inexpensive manufacturing
tests. It is shown that the tests gen-erated from the traditional, structural
view add hardware overhead, increase design time and potentially lower
effective yield when compared to the tests generated from the functional view.
A test generation procedure is described and successfully used on a
microprocessor design.
Index terms : Microprocessor Testing, Test Pattern Generation, Fault
Simulation, Clocks, Clock Regenerators.
In recent years, logic emulation has been widely used as a key design verification methodology in many complex CPU, telecom, and multimedia design projects. When using logic emulation for design verification, designers often need toperform engineering changes as a result of design debugging of a design specification modification. One of the essential issues to engineering changes is the turn-around time. Ideally, after designers modify their designs, they resume their debugging and verification tasks immediately. However, converting a design from its Register-Transfer-Level (RTL) description to a target emulator is a time-consuming procedure which may take hours. Such long engineering-change turn-around times are unacceptable by the designers. In this paper, we present a real-time RTL engineering-change method supporting on-line debugging for logic-emulation applications. We propose a novel design method which is able to link design data generated at different design stages in a unified way. Using this method, the users can immediately locate the portion of the circuit design affected by the design modification from its RTL specification. This feature provides users with a fast time-to-debug environment by significantly improving the efficiency of the engineering-change process. We have developed aprototype system Quick ECO supporting interactive on-line RTL engineering changes. Experimental results on a number of industrial designs are reported to demonstrate the effectiveness of the proposed method.
In this paper, we introduce a Shared Multiple Rooted XOR-based Decomposition Diagram (XORDD) to represent functions with multiple outputs. Based on the XORDD representation, we develop a synthesis algorithm for general Exclusive Sum-of-Product forms (ESOP). By iteratively applying transformations and reductions, we obtain a compact XORDD which gives a minimized ESOP. Our method can synthesize larger circuits than previously possible. The compact ESOP representation provides a form that is easier to synthesize for XOR heavy multi-level circuit, such as arithmetic functions. We have applied our synthesis techniques to a large set of benchmark circuits in both PLA and combinational formats. Results of the minimized ESOP forms obtained from our synthesis algorithm are also compared to the SOP forms generated by ESPRESSO. Among the 74 circuits we have experimented with, the minimized ESOP's have fewer product terms than those of SOP's in 39 circuits.
We define a notion of equivalence for designs containing black boxes i.e., components whose functionality is not known; these arise naturally in the course of hierarchical design. Using this notion, we describe a sound and complete methodology for optimizing such designs.
Unate and binate covering problems are a special class of general integer
linear programming problems with which several problems in logic synthesis,
such as two-level logic minimization and technology mapping, are formulated.
Previous branch-and-bound methods for exactly solving these problems use
lower-bounding techniques based on finding maximal independent sets. In this
paper we examine lower-bounding techniques based on linear programming
relaxation (LPR) for the binate covering problem. We show that a combination of
traditional reductions (essentiality and dominance) and incremental computation
of LPR-based lower bounds can exactly solve difficult covering problems orders
of magnitude faster than traditional methods.
Keywords- overing problems, integer linear programming.
Graph coloring has several important applications in VLSI CAD. Since graph coloring is NP-complete, heuristics are used to approximate the optimum solution. But heuristic solutions are typically 10% off, and as much as 100% off, the minimum coloring. This paper shows that since real-life graphs appear to be 1-perfect, one can indeed solve them exactly for a small overhead.
A hierarchical two-dimensional field solution technique is introduced for capacitance extraction for VLSI interconnect modeling. As a basis for compromise between the efficiency of Boolean rules-based extraction and the accuracy of flat field solution, this hierarchical approach can handle realistic conductor cross-sections and multiple conformal and/or planarized dielectrics.
In this paper we prove that simply discarding conductors beyond a certain spacing during BEM capacitance extraction will result in a lower bound on the self-capacitance calculations and an upper bound on the mutual capacitance calculations that lie within that spacing. We prove that a potential-shift and truncate scheme can yield bounds opposite to those for the truncate only case; namely, an upper bound on the self capacitance and a lower bound on the mutual capacitance that lies within the chosen spacing. The ease with which the upper and lower bounds are calculated is shown, and their utility for selection of an optimal window size is described. A metal shell is also presented here that results in bounds similar to those of shift-truncate. We further propose a new potential-shift function that yields increased approximation accuracy compared to shift-truncate in many cases.
Extracting the inductance of complex interconnect topologies is a formidable task, and simulating the resulting dense partial inductance matrix is even more difficult. Furthermore, it is well known that simply discarding smallest terms to sparsify the inductance matrix can render the partial inductance matrix indefinite and result in an unstable circuit model. In this paper, we describe a methodology for incrementally generating a sparse partial inductance matrix based on using moments about s=0 to determine when a sufficient number of mutual inductances have been captured. The minimally required mutual inductances are extracted for a provably stable model.
The Method of Moments (MoM) is often effectively used in the extraction of passive components in modeling integrated circuits and MCM packaging. MoM extraction, however, involves solving a dense system of linear equations, and using direct factorization methods can be prohibitive for large problems. In this paper, we present a Fast Method of Moments Solver (FMMS) for the rapid solution of such linear systems. Our algorithm exploits the fact that the integral equation kernels are locally 'smooth' and can be dramatically compressed via the singular value decomposition (SVD). This greatly speeds up the matrix-vector products in a Krylov-subspace iterative algorithm (e.g., GMRES). We demonstrate the efficiency and exibility of our scheme for the modeling of embedded inductors in MCM-D. Results are presented to show that the method isaccurate and can be two orders of magnitude faster than Gaussian elimination and one order of magnitude faster than standard iterative schemes.
This paper examines the problem of statically analyzing the performance of embedded software. This problem is motivated by the increasing growth of embedded systems and a lack of appropriate analysis tools. We study different performance metrics that need to be considered in this context and examine a range of techniques that have been proposed for analysis. Very broadly these can be classified into path analysis and system utilization analysis techniques. It is observed that these are interdependent, and thus need to be considered together in any analysis framework.
This paper introduces the first high-level (task-level) model of hierarchical memories and describes a scheduling and allocation algorithm for system-level synthesis of heterogeneous multiprocessors. Caches are essential for modern RISC embedded cores to obtain sustained high performance. However, caches have received limited use in priority-driven preemptive real-time systems due to the unpredictability of caches average-case improvements are of no use in systems with hard deadlines. Program-level cache models do not take into account preemptions between multiple tasks running at multiple rates on embedded cores. Our task-level model of performance in the presence of memory hierarchies provides an efficient means to bound the guaranteed memory performance of tasks running in a multi-rate, multi-tasking environment. Our system synthesis algorithm uses software-based cache partitioning and reservation techniques to guarantee cache hits for some tasks and therefore improve task schedulability. Experimental results show that our model significantly improves schedulability of real-time tasks and can be evaluated efficiently during system-level synthesis.
The degree of flexibility of a real-time system architecture indicates the capability of the system to tolerate perturbations in timing related specifications. Flexibility is also an important factor in the trade-off studies between cost and performance. In this paper, we identify the need for a flexibility metric and show that the existing real-time analysis results cannot be directly used as such a metric. We formulate new metrics and illustrate their effectiveness in comparing the flexibility of different system architectures.
Many modern systems are designed as a set of interconnected reactive subsystems. The subsystem verification task is to verify an implementation of the subsystem against the simple deterministic high-level specification of the entire system. Our verification methodology, based on Symbolic Trajectory Evaluation, is able to bridge the wide gap between the abstract specification and the implementation specific details of the subsystem. This paper presents a detailed description of an industrial application of this methodology to the fixed point execution unit of the PowerPC processor. We were able to verify a representative instruction under all possible stall, bypass, pipeline conditions and under all possible timings for interface to other functional units in the processor.
In this paper we report on new techniques for verifying content addressable memories (CAMs), and demonstrate that these techniques work well for large industrial designs. It was shown in [6], that the formal verification technique of symbolic trajectory evaluation (STE) could be used successfully on memory arrays. We have extended that work to verify what are perhaps the most combinatorially difficult class of memory arrays, CAMs. We use new Boolean encodings to verify CAMs, and show that these techniques scale well, in that space requirements increase linearly, or sublinearly, with the various CAM size parameters. In this paper, we describe the verification of two CAMs from a recent PowerPC TM microprocessor design, a Block Address Translation unit (BAT), and a Branch Target Address Cache unit (BTAC). The BAT is a complex CAM, with variable length bit masks. The BTAC is a 64entry, 64bits per entry, fully associative CAM and is part of the speculative instruction fetch mechanism of the microprocessor. We believe that ours is the first work on formally verifying CAMs, and we believe our techniques make it feasible to efficiently verify the variety of CAMs found on modern processors.
We present our experiences with the formal verification of an automotive chip used to control the safety features in a car. We used a BDD based model checker in our work. We describe our verification methodology for verifying a very complicated property on a relatively large design. We also describe the bugs that were found and present our views on how to make model checking an effective integrated part of the design flow for complex hardware systems.
A new system design methodology is proposed that separates com-munication from behavior. To demonstrate the methodology we applied it to a simple ATM design. Since verification is clearly a major stumbling block for large system design, we focussed on the verification aspects of our methodology. In particular, a simulator was developed that is based on the communication paradigm typical of our methodology. The simulator gives substantial performance improvements without sacrificing user access to detail. Finally, the potential for this methodology to improve verification, modeling and synthesis is explored.
This paper presents an integrated design environment that supports the design and analysis of digital systems from initial concept to the final implementation. The environment supports both system level performance and dependability analysis from a common modeling representation. A tool called ADEPT (Advanced Design Environment Prototype Tool) has been developed to implement the environment. ADEPT is based on IEEE 1076 VHDL and uses commercial schematic capture systems as a front end via an EDIF interface. Several examples are presented which demonstrate various aspects of the environment.
The rapid increase in the complexity of systems demands new approaches to design exploration and trade-off analysis. This paper presents an exploration environment that provides a uniform way to perform exploration at the conceptual (pre-specification) stages of design. The environment encapsulates knowledge about how to perform estimations in different application domains, and separates a designer from the mechanics of the estimation process.
Binary Decision Diagrams (BDDs) are efficient at manipulating large sets in a compact manner. BDDs, however, are inefficient at utilizing the memory hierarchy of the computer. Recent work addresses this problem by manipulating the BDDs in breath-first manner (BFS). BFS processing is quite successful at reducing the number of page faults when the BDDs do not fit in the available physical memory. When paging does not take place, it is much less clear which paradigm leads to the better performance. In this paper, we perform a detailed analysis of BFS and DFS packages using simulation and direct performance monitoring of the memory hierarchy. We show that there is very little difference in TLB and cache miss rates for DFS and BFS paradigms. We also show that differences in execution time between carefully tuned BFS and DFS implementations are primarily a function of the lossless computed table used in BFS implementations, and not a function of memory locality. Furthermore, we present implementation changes to the the Cudd package that can improve execution times by asmuch as 26% when the problem fits in main memory, and a factor of six when paging is involved.
We propose a new algorithm, called linear sifting, for the optimization of decision diagrams that combines the efficiency of sifting and the power of linear transformations. We show that the new algorithm is applicable to large examples, and that inmany cases it leads to substantially more compact diagrams when compared to simple variable reordering. We show inwhat sense linear transformations complement variable reordering, and we discuss applications of the new technique to synthesis and verification.
In many computer-aided design tools, binary decision diagrams (BDDs) are used to represent Boolean functions. To increase the efficiency and capability of these tools, many algorithms have been developed to reduce the size of BDDs. This paper presents heuristic algorithms that minimize the size of BDDs representing incompletely specified functions by intelligently assigning don'tt cares to binary values. The traditional algorithm, restrict [8], is often effective in BDD minimization, but can increase the BDD size. We propose new algorithms based on restrict which are guaranteed never to increase the size of the BDD, thereby significantly reducing peak memory requirements. Experimental results show that our techniques typically yield significantly smaller BDDs than restrict.
This paper presents new results in the area of timing optimization for multi-source nets. The Augmented RC-Diameter (ARD) is proposed as a natural and practical performance metric and a linear time algorithm for computing the ARD of a multi-source net is presented. Building on the ARD, an algorithm for optimal repeater insertion is presented: for a given multi-source topology the algorithm efficiently identifies an optimal assignment of repeaters to prescribed insertion points under the 'min cost timing feasible' problem formulation. The algorithm has been implemented and preliminary experimental results are promising.
This paper addresses how to compute required times at intermediate nodes in a combinational network given required times at primary outputs. The simplest approach is to compute them based on topological delay analysis without any consideration of false paths. In this paper, however, we take into account false paths between the intermediate nodes and the primary outputs explicitly to characterize the timing constraints at the nodes more accurately. We show that this approach leads to a technique for computing a more refined and relaxed timing constraint than that obtained by topological analysis. We generalize the notion of required times from a single constant to a relation where a signal is required at different times depending on the values of the other signals.
We present a novel set of tools for performing symbolic timing verification of timing diagrams. The tools are multi-purpose with uses in verification, derivation of synthesis constraints, and design evaluation. Our methodology is based on using techniques for manipulating Presburger formulas. We demonstrate using several interesting examples that the method is efficient in practice and should be considered for inclusion in commercial tools.
System design, i.e. the design of board-level circuits and systems-on-a-chip, focuses on the integration of off-the-shelf and application-specific VLSI components. A key aspect of system design is to ensure the satisfaction of component interface timing requirements. This is necessary for the correct exchange of information among components on a system. We present a methodology for the interface timing verification and subsequent timing-driven floorplanning of systems. We present results on the application of this methodology to real-world circuits.
Entire systems embedded in a chip and consisting of a processor, memory, and system-specific peripheral hardware are now commonly contained in commodity electronic devices. Cost minimization of these systems is of paramount economic importance to manufactures of these devices. By employing a variable configuration processor in conjunction with a multi-precision compiler generator there are situations in which considerable system cost reduction can be obtained by synthesizing a CPU that is narrower than the largest variable in the application program.
Numerous fast algorithms for the Discrete Cosine Transform (DCT) have been proposed. Until recently, it has been difficult to compare different DCT algorithms and select one which is best suited for implementation under a given set of design goals. We propose an approach for design space exploration at the algorithm and behavioral levels using behavioral synthesis tools and demonstrate its effectiveness for designing DCT ASIC. In particular, we study and compare the following nine DCT algorithms: Lee's, Wang's, DIT, DFT, Arai's, DIF, Vetterli's, planar rotation, and direct algorithm. The main conclusions of this study are (i) the best choice among fast DCT algorithms depends on a particular set of design goals and constraints and (ii) almost always more than an order of magnitude improvement can be achieved using algorithm and behavioral design space exploration.
This tutorial addresses the following questions:
This paper presents a verification technique which is specifically targeted to formally comparing large combinational circuits with some structural similarities. The approach combines the application of BDDs with circuit graph hashing, automatic insertion of multiple cut frontiers, and a controlled elimination of false negative verification results caused by the cuts. Two ideas fundamentally distinguish the presented technique from previous approaches . First, originating from the cut frontiers, multiple BDDs are computed for the internal nets of the circuit, and second, the BDD propagation is prioritized by size and discontinued once a given limit is exceeded.
Widely-separated time scales appear in many electronic circuits, making traditional analysis difficult or impossible if the circuits are highly nonlinear. In this paper, an analytical formulation and numerical methods are presented for treating strongly nonlinear multi-rate circuits effectively. Multivariate functions in the time domain are used to capture widely separated rates efficiently, and a special partial differential equation (the MPDE) is known to relate the multicariate forms of a circuit signals. Time-domain and mixed frequency-time simulation algorithms are presented for solving the MPDE. The new methods can analyze circuits that are both large and strongly nonlinear. Compared to traditional techniques, speedups of more than two orders of magnitude, as well as improved accuracy, are obtained.
Fault-driven analog and mixed-signal testing calls for rapid fault simulation techniques. A problem that has not been addressed effectively by existing research is that circuit parameters have tolerance ranges. In this paper, we propose representing parameters under variations as intervals, and present an efficient algorithm based on interval analysis and Householder's formula to compute the worst-case response bounds of good and faulty linear(ized) circuits under parameter variations. Our approach takes CPU time comparable to one nominal circuit simulation, and always produces correct and conservative results. The algorithm has been implemented into SPICE3F5. Experimental results show an acceptable accuracy.
A tool for the switch-level fault simulation and test evaluation of switched-capacitor systems is presented. Time or frequency-domain fault simulations with SWITCAP and time-domain fault simulations with HSPICE can be performed. Adequate fault models are presented for both simulators. The tool has proven to be very useful in the early evaluation of test strategies, providing similar results to those obtained at the transistor-level.
Many application-specific architectures provide indirect addressing modes with auto-increment/decrement arithmetic. Since these architectures generally do not feature an indexed addressing mode, stack-allocated variables must be accessed by allocating address registers and performing address arithmetic. Subsuming address arithmetic into auto-increment/decrement arithmetic improves both the performance and size of the generated code. Our objective in this paper is to provide a method for comprehensively analyzing the performance benefits and hardware cost due to an auto-increment/ decrement feature that varies from -l to +l, and allowing access to k address registers in an address generator. We provide this method via a parameterized optimization algorithm that operates on a procedure-wise basis. Hence, the optimization techniques in a compiler can be used not only to generate efficient or compact code, but also to help the designer of a custom DSP architecture make decisions on address arithmetic features. We present two sets of experimental results based on selected benchmark programs: (1) the values of l and k beyond which there is little or no improvement in performance, and (2) the values of l and k which result in minimum code area.
The design process for fixed-point implementations either in software or in hardware requires a bit-true specification of the algorithm in order to analyze quantization effects on an algorithmical level, abstracting from implementational details. On the other hand, system design starts from a floating-point description, so that a transformation of a floating-point description into a fixed-point description becomes necessary. Within this paper we present a tool that allows an automated, interactive transformation from floating-point ANSI-C into a bit-true specification based on a new data type fixed that is introduced as an extension to ANSI-C. The concept is rooted in a sophisticated data dependency analysis that allows to handle control structures as well as pointers. It is part of the fixed-point design environment FRIDGE 1 which includes an advanced simulator that covers the extended ANSI-C syntax as well as target specific compilers which allow to generate efficient fixed-point implementations either for HW or for SW, starting from the bit-true algorithm specification.
We present the Instruction Set Description Language, ISDL, a machine description language used to describe target architectures to a retargetable compiler. The features and flexibility of ISDL enable the description of vastly different architectures, in particular VLIW architectures. ISDL explicity supports constraints that define valid operation groupings within an instruction, increasing the range of specifiable architectures. We have written a tool that, given an ISDL description of a processor, automatically generates an assembler for it. Ongoing work includes the development of an automatic code-generator generator.
An experimental set of tools that generate instruction set simulators, assemblers, and disassemblers from a single description was developed to test if retargetable development tools would work for commercial DSP processors and microprocessors. The processor instruction set was described using a language called nML. The TMS320C50 DSP processor and the ARM7 microprocessor were modeled in nML. The resulting instruction set models execute about 25,000 instructions per second, and compiled instruction set simulation models execute about 150,000 instructions per second. The viability of this approach and the deficiencies of nML are discussed.
Exploitation of deep-submicron technology will depend critically on the availability of global system engineers able to bridge the gap between software-centric system thinking and hardware-software implementation of it in novel silicon architectures. This requires a rethinking of present engineering schools which are not well equipped to tackle global system engineering aspects. The concept of design institute is introduced where , based on visionary system design demonstrators, new methodologies, tools, libraries and courses are created and distributed over the global network. Design institutes provide a learning school for new design paradigms and form the ideal environment for the education of global system designers.
The InfoPad project was started at UC Berkeley in 1992 to investigate the issues involved in providing multimedia information access using a portable, wireless terminal. It quickly became clear that a key design constraint was the energy consumption, which could best be addressed through an integrated system approach. The project was therefore organized to address all design levels, including the applications and user interface, backbone network protocols, software for distributed network support, the wireless link, and the pad itself which used a number of low voltage ASIC designs and a processor running embedded code. Tools were developed when not available (particularly in support of low energy design), as well as an interface to mechanical designers who created a custom injection molded case. The wide scope of the project presented a number of unique challenges for a research environment and the lessons learn will be presented.
The Wearable Computer Project is a testbed integrating research on rapid design and prototyping. Based on representative examples from six generations of wearable computers, the paper focuses on the differences in rapid prototyping using custom design versus off-the-shelf components. The attributes characterizing these two design styles are defined and illustrated by experimental measurements. The off-the-shelf approach required ten times the overhead, 30% more cost, fifty times the storage resources, 20% more effort, five times more power, but 30% less effort to port software than the embedded approach.
Owing to rapid changes of IC technologies, traditional design rule checking is becoming inadequate to assure satisfactory levels of IC manufacturability. This paper describes a new computer supported design analysis environment that improves the efficiency of manufacturability assessment of new products. This environment, called MAPEX 2, is described in the paper along with some of its key procedures and algorithms. Illustrations of MAPEX 2 applications and performance figures are provided as well.
This paper describes a fully automatic standard-cell layout synthesis system, CELLERITY. The system is flexible in supporting a wide variety of process technologies and a range of library template styles. The tool is fully automatic and provides several options to the user to customize the layout template. The tool considers performance and yield and generates dense, design-rule correct layouts. Experimental results indicate that the area of CELLERITY-generated standard cells is competitive with manually designed cells in a majority of circuits. In block-level tests of industrial circuits, standard-cell blocks generated using CELLERITY cells are about equal to the block area produced by using a manually-designed library. Recently, an embedded microcontroller in a state-of-the-art sub-micron process technology was fabricated using CELLERITY-generated standard cells.
This paper describes the development of a concurrent methodology for standard cell library generation. Use of a novel physical design automation method enables a high degree of concurrency among process, circuit, and layout development. In addition to reducing overall time-to-market, the new method allows optimization to occur simultaneously across the circuit, layout, and process design spaces. The result is libraries with improved density, circuit performance, and process yield.
Cell characterization data is used by synthesis and timing verification tools to compile and validate a cell netlist which meets timing constraints imposed by the designer. Characterization tables contain data for multiple, simple equations representing a cells behavior and are an alternative to the single, monolithic characteristic equation. Data in the table is fit to a function whose form is fixed by the application, and the cells response is interpolated from the function. Tables can potentially increase accuracy, but large tables can cause a program to use dramatically more memory and run much slower. The optimization of characterization tables, in which accuracy is maintained but table size is significantly reduced, is important if large programs, such as synthesis, are to complete accurately and in a reasonable runtime. In this paper we address some of the issues involved in characterizing cells and optimizing characterization tables quickly and accurately. Experimental results from the use of these techniques within AMD for a Synopsys cell library is also presented.
In behavioral descriptions, statements that allow limited control jumps, such as Verilog disable statements on named blocks, are often used for describing system behavior in presence of exceptions. In this paper, we extend Timed Decision Tables (TDT), a tabular behavior model, to represent more general control structures including exceptions. We introduce the notion of action sharing that allows us to reduce resource requirements using existing high-level synthesis tools on descriptions with control exceptions. We present presynthesis algorithms that work on the extended TDT model and an algorithm that performs action sharing in TDT models. Our experiments on well-known HardwareC benchmarks show size reduction resulting from sharing actions in the input behavioral descriptions.
Successive, well organized application of transformations has been widely recognized as an exceptionally effective, but complex and difficult CAD task. We introduce a new potential-driven statistical approach for ordering transformations. Two new synthesis ideas are the backbone of the approach. The first idea is to quantify the characteristics of all transformations and the relationship between them based on their potential to reorganize a computation such that the complexity of the corresponding implementation is reduced. The second one is based on the observation that transformations may disable each other not only because they prevent the application of the other transformation, but also because both transformations target the same potential of the computation. These two observations drastically reduce the search space to find efficient and effective scripts for ordering transformations. A key algorithmic novelty is that both conceptual and optimization insights as well as all optimization algorithms are automatically derived by organized experimentation and statistical methods. On a large set of diverse real-life examples improvements in throughput, area, and power by large factors have been obtained. Both qualitative and quantitative statistical analysis indicate effectiveness, high robustness, and con-sistency of the new approach for ordering transformations.
Synthesis of Application Specific Programmable Processors poses numerous new tasks on behavioral synthesis tools. We address some of them including application bundling. Application Bundling is a synthesis task where n control-data flow graphs are bundled into at most m groups, so that each application belongs to at least one group and through-put constraints for all applications are satisfied. We have shown how avariety of application specific constraints such as manufacturing cost reduction and production risk reduction can be targeted during the synthesis process. The effectiveness of our approach is demonstrated on a number of real examples.
Often during the design process, it is necessary to analyze the effects of tradeoff among various performance attributes of the design. Visual representations in the form of plots, graphs, and tables are typically used. These visualizations can be generated using different applications such as MatLab R , Mathematica R , Excel R , and so forth. In order to use these tools, performance equations for the design must be rendered in an equational form with only a few variables, usually 2-3. However, performance models usually consist of a large number of attributes and evaluation procedures, typically written using the full power of a programming or hardware description language. We introduce a symbolic evaluation procedure to simplify performance models. Given partial data, such evaluation yields a residual performance model which is simpler than the original model. Symbolic evaluation can be effectively used to obtain residual performance equations in terms of the variables whose tradeoff visualization is of interest to the user.
A modeling approach is presented that captures the dependence of the power dissipation of a combinational logic circuit on its input/output signal switching activity. The resulting power macromodel, consisting of a single three dimensional table, can be used to estimate the power consumed in the circuit for any given input/output signal statistics. Given a low-level (typically gate-level) description of the circuit, we describe a characterization process by which such a table model can be automatically built. In contrast to other proposed techniques, this can be done for any given logic circuit without any user intervention, and applies to all possible input/output signal statistics; it does not require one to construct specialized analytical equations for the power dissipation. The three dimensions of our table-based model are the average input signal probability, average input transition density, and average output zero-delay transition density. This approach has been implemented and models have been built for many benchmark circuits. Over a wide range of input signal statistics, we show that this model gives very good accuracy, with an RMS error of under about 6%.
This paper proposes to use quantile points of the cumulative distri-bution function for power consumption to provide detailed information about the power distribution in a circuit. The paper also presents two techniques based on population pruning and stratification to improve the efficiency of estimation. Both population pruning and stratification are based on a lowcost predictor, such as zero-delay power estimate. Experimental results show the effectiveness of the proposed techniques in providing detailed power distribution information.
In this paper, we present a new statistical technique for estimating average power dissipation in sequential circuits. Due to the feedback mechanism, in sequential circuits power dissipation in consecutive clock cycles are temporally correlated, which violates the basic requirement of statistical mean inference procedures. We overcome this problem by using a randomness test and a sequential procedure to select a proper independence interval, which in turn is used to generate random power samples. A distribution-independent stopping criterion is applied to analyze the sample data and terminate the simulation upon achievement of the accuracy specification. The technique is successfully applied to a set of benchmark circuits.
We present two new algorithms for generating a small set of patterns for estimating the maximum instantaneous current through the power supply lines for CMOS circuits. The first algorithm is based on timed ATPG, while the second is a probability-based approach. Both algorithms can handle circuits with arbitrary but known delays and they produce a set of 2-vector tests. Experimental results demonstrating that the outcome of applying our algorithms is a small set of patterns producing a current that is a tight lower bound on the maximum instantaneous current are included.
Hardware/software co-simulation is generally performed with separate simulation models. This makes trade-off evaluation difficult, because the models must be re-compiled whenever some architectural choice is changed. We propose a technique to simulate hardware and software that is almost cycle-accurate, and uses the same model for both types of components. Only the timing information used for synchronization needs to be changed tomodify the processor choice, the implementation choice, or the scheduling policy. We show how this technique can be used todecide the implementation of a real-life example, a car dashboard controller.
Many co-simulation techniques either suffer from poor performance when simulating communications intensive systems, or they represent communications with a uniformly low level of detail. This paper presents a technique which allows communication to be represented at multiple levels of detail and which gives a designer the ability to dynamically choose the appropriate level for different parts of the system. This paper also presents a tool which uses this technique and experiments which show relative simulation speedups.
We demonstrate a new approach for minimizing the total of the static and the dynamic power dissipation components in a CMOS logic network required to operate at a specified clock frequency using joint optimization of both device and circuit designs for a specific logic schematic and activity profile. We present a new approach to designing ultra low-power CMOS logic circuits by joint optimization of supply voltage, threshold voltage and device widths for a specified speed constraint. The static (leakage) and dynamic (switching) energy components are considered and an efficient heuristic is developed that delivers over an order of magnitude savings in power over conventional optimization methods.
Multi-threshold CMOS is an increasingly popular circuit approach that enables high performance and low power operation. However, no methodologies have been developed to size the high Vt sleep transistor in an intelligent manner that trades off area and performance. In fact, many attempts at sizing the sleep transistor without close consideration of input vector patterns or internal structures can lead to large overestimates or large underestimates in sleep transistor sizing. This paper describes some of the issues involved in sizing transistors for MTCMOS and also introduces a variable breakpoint switch level simulator that can rapidly calculate delay in MTCMOS circuits as functions of design variables such as Vdd , Vt , and sleep transistor sizing.
We describe an architectural design space exploration methodology that minimizes the energy dissipation of digital circuits. The centerpiece of our methodology is a Verilog-based power estimation tool, Pythia, that blends the accuracy of low-level circuit simulators such as powermill with the speed of high level power estimators geared to design exploration. Pythia takes into account voltage-dependent capacitive nonlinearities and supports runtime adaptation of supply voltage. It employs a hybrid modeling aproach in which low-level simulation of logic gates and flip-flops can be combined with high level macromodels for memory structures where the energy per access is not as sensitive to input data statistics. The speed and accuracy of Pythia has enabled a detailed case study of two different approaches for the computation of the Inverse Discrete Cosine Transform, an integral component of the MPEG video coding algorithm. One approach uses conventional methods and the other exploits signal statistics to dynamically minimize the average number of operations required and the operating supply voltage.
This paper presents a power evaluation framework designed for estimating power consumption of a new video telephone compression standard, ITU-H.263, at the system level. A hierarchical, mixed-level simulation environment is built and cycle-accurate power macro-modeling is used for the architectural power evaluation. Experimental results show the effectiveness of the proposed framework and models.
Logic and structural transformations to reduce power have been considered by a number of authors. [2, 3, 4, 5]. However, all the results presented so far have been based on models of power and estimation methods that may not be a accurate reflection of `real world' constraints. The performance requirements of industrial designs often significantly restrict the applicability of the power reducing transformations. In this paper we present the results of investigations on the efficacy of a recently reported logic level power optimizing algorithm [5] on some commercial circuits, using customer supplied input waveforms. Based onaaccurate delay model, the power consumption of the circuits is estimated using the commercial package called `PowerMill'. The experimental results show that a maximum 21% average power and 27.7% peak power power reduction can be obtained with only 3% delay increase. Based on the experiments, we propose a general methodology for the practical application of the transformations for power optimization of CMOS logic circuits. Finally, a comparison between the experiments using random input waveforms and customer provided input waveforms is presented.
This paper presents a low-overhead controller-based power management technique that re-specifies control signals to reconfigure existing multiplexer networks and functional units to minimize unnecessary activity. We demonstrate that conventional power management techniques may often not be suited to control-flow intensive designs, and provide a comprehensive analysis of the potential negative effects of power management on circuit delay, glitching activity at control and data path signals, and formation of false combinational cycles. We present techniques to perform power management through controller re-specification while avoiding the above negative effects, and demonstrate the efficiency of the techniques through experiments.
This paper presents for the first time low energy simultaneous memory and register allocation. A minimum cost network flow approach is used to efficiently solve for minimum energy dissipation solutions in polynomial time. Results show that estimated energy improvements of 1.4 to 2.5 times over previous research are obtained. This research is important for industry since energy dissipation is minimized without requiring an increase in cost.
In this paper, a transformation technique, called power-conscious loop folding is proposed for high level synthesis of a low power system. Our work is focused on reducing the power consumed by functional units through the decrease of switching activity in a data path dominated circuit containing loops. The transformation algorithm has been implemented and integrated into a high level synthesis system for experiments. In our experiments, we could achieve power reduction of up to 50% for circuits dominated by functional units.
We present a subjective review of custom cell generation methods in the context of future advances in state-of-the-art digital circuit synthesis. In particular, we describe three opportunities for coupling circuit optimization operations with the library development process. These operations include electrical optimization, technology mapping, and cell level place and route.
We present a novel technique CLIP for optimizing both the height and width of CMOS cell layouts in the two-dimensional (2-D) style. CLIP is based on integer-linear programming (ILP) and proceeds in two stages: First, an ILP model is used to determine a 2-D layout of minimum width W cell . Then, another model generates a 2-D layout that has width W cell and requires a minimum number of routing tracks. Run times are in seconds for circuits with up to 16 transistors. For larger circuits, we extend CLIP to a hierarchical method HCLIP that places series-connected transis-tors contiguously. This reduces run times by up to three orders of magnitude, and still yields optimal results in over 80% of cases.
In timing-driven layout synthesis, transistor sizes tend to be significantly different from each other and thus the use of conventional layout approaches can cause inefficient area utilization. We propose an efficient algorithm to find the optimal transistor folding sizes in row-based designs. Our algorithm finds optimal folding sizes given a CMOS circuit with m pairs of pMOS and nMOS transistors in O(m^2 log m) time complexity with the effective reduction of the solution space. MCNC benchmark circuits are used to demonstrate the area-efficiency of the physical layouts with optimal folding sizes.
The ability to recognize polygon-based layout as a collection of objects representing circuit elements connected by path-based wires, enables existing designs implemented using an older fabrication process to be reimplanted quickly in a new process. The approach taken here, based on layout generator technology, is to create a collection of parameterized circuit objects that, with appropriate arguments, are able to represent the devices (e.g., transistors, contacts) implicitly described in the flattened design. The recognition engine is fully programmable, is independent of any particular technology or device set, and is not restricted to manhattan or even octilinear geometries. In this paper, we describe a novel three-phase approach to object recognition: device recognition, exact wire recognition, and wire synthesis. We believe exact wire recognition to be entirely new and cover it in detail. Experimental results demonstrate the effectiveness of the algorithms on actual layout.
In this paper, we present a test synthesis approach which integrates BALLAST (BALAnced structure Scan Test) with an enhanced test point insertion (TPI) algorithm to functionally scan the flip-flops chosen by BALLAST. BALLAST is an attractive partial scan technique in that it offers combinational ATPG efficiency while promising to reduce full scan overhead. However, the practical problem with BALLAST is it typically requires more scan flip-flops than other partial scan techniques. The TPI enhancements enable TPI to aim at the reduction of BALLAST overhead. The enhancements include a more exible test point insertion heuristic, a modified gain function which enables TPI to target a selected set of flip-flops, and a more efficient procedure to remove redundant test points. The experimental results on nine benchmark circuits show the proposed test synthesis approach can achieve on average 38% area saving compared to full scan, while BALLAST alone achieves 17%.
This paper presents a new scan-based BIST scheme which achieves very high fault coverage without the deficiencies of previously proposed schemes. This approach utilizes scan order and polarity in scan synthesis, effectively converting the scan chain into a ROM capable of storing some center patterns from which the other vectors are derived by randomly complementing some of their coordinates. Experimental results demonstrate that a very high fault coverage can be obtained without any modification of the mission logic, no test data to store and very simple BIST hardware which does not depend on the size of the circuit.
We propose a new algorithm for test point selection for scan-based BIST. The new algorithm combines the advantages of both explicit-testability-calculation and gradient techniques. The test point selection is guided by a cost function which is partially based on explicit testability recalculation and partially on gradients. With an event-driven mechanism, it can quickly identify a set of nodes whose testability need to be recalculated due to a test point, and then use gradients to estimate the impact of the rest of the circuit. In addition, by incorporating timing information into the cost function, timing penalty caused by test points can be easily avoided. We present the results to illustrate that high fault coverages for both area-and timing-driven test point insertions can be obtained with a small number of test points. The results also indicate a significant reduction of computational complexity while the qualities are similar to the explicitly-testability-calculation method.
This paper describes an automated design and synthesis methodology for telecommunications ASICs. Array frames, an array structured visualization of the processing problem, are used in the specification and debugging of the design. This allows design much closer to the specification level. The array frame concept is integrated into the Dali structured control logic design environment. An SDH (Synchronous Digital Hierarchy) style example demonstrates the use of array frames. The array frame approach was successfully applied in industrial applications.
This paper describes a top-down hierarchical simulation process that dramatically accelerates the design-evolution of DSP systems when compared with traditional physical prototyping methods. A summary of the model abstraction hierarchy and purposes for the various types of models provides critical guidance in selecting appropriate abstractions to achieve accurate yet efficient simulations. The simulations enable rapid exploration of many more alternative software and hardware solutions than would be practical using conventional gate-level simulation technologies. The result is improved design adaptations and quality.
A methodology for architecture exploration of look-up table based decoders is presented. For the degree of parallel processing a trade-off can be made by exploring system level and register transfer level models. Executable specifications (pure functional software models, VHDL behavior models) are used to analyze the performance of different architectures. Hardware cost (area) and feasibility (timing) are determined by synthesis of RTL models. These models are generated directly out of the specification to avoid errors due to manual transformations and to reduce overall design time. Generator-based reuse modeling and hardware cost estimation is demonstrated using a decoder for MPEG variable length codes (VLC).
Network flow approaches have been used for partitioning with success in the past. However, most of them can not deal with size constraints directly in partitioning. Instead, they incorporate the size constraints implicitly in the objective function. This paper presents a new network flow approach for partitioning circuits into tree hierarchies. We formulate a linear program for the hierarchical tree partitioning problem by spreading metrics proposed in [3][4]. The size constraints in partitioning can be formulated directly as linear constraints. Motivated by the duality between the linear programs for partitioning and network flow problems, we devise a heuristic algorithm based on network flow and spreading metric computations. Experimental results demonstrate that the new algorithm can generate better solutions for MCNC benchmarks.
In this paper, we present a new integrated synthesis and partitioning method for multiple-FPGA applications. This method first synthesizes a design specification in a fine-grained way so that functional clusters can be preserved based on the structural nature of the design specification. Then, it applies a hierarchical set-covering partitioning method to form the final FPGA partitions. Our approach bridges the gap between HDL synthesis and physical partitioning by fully exploiting the design hierarchy. Experimental results on a number of bench-marks and industrial designs demonstrate that I/O limits are the bottleneck for CLB utilization when applying a traditional multiple-FPGA synthesis method on attened netlists. In contrast, by fully exploiting the design structural hierarchy during the multiple-FPGA partitioning, our proposed method produces fewer FPGA partitions with higher CLB and lower I/O-pin utilizations.
This paper addresses an automatic partitioning method of a design into several FPGAs. Although the circuit partitioning methods have recently been significantly advanced, partitioning is commonly performed at the gate netlist level. To cope with large designs and explore the solution space efficiently, clustering of the logic is mandatory. In this paper, the hierarchy of the design, naturally introduced by the designer, guides the partitioning. The basic concepts are introduced in terms of "envelope" delimiting hierarchy blocks. These concepts lead to an "envelope"-based clustering and to the proposed final hierarchy-driven partitioning. Results are given on industrial examples on XILINX 4000 technology.
In this paper, we present a new hypergraph partitioning algorithm that is based on the multilevel paradigm. In the multilevel paradigm, a sequence of successively coarser hypergraphs is constructed. A bisection of the smallest hypergraph is computed and it is used to obtain a bisection of the original hypergraph by successively projecting and refining the bisection to the next level finer hypergraph. We evaluate the performance both in terms of the size of the hyper-edge cut on the bisection as well as run time on a number of VLSI circuits. Our experiments show that our multilevel hypergraph partitioning algorithm produces high quality partitioning in relatively small amount of time. The quality of the partitionings produced by our scheme are on the average 4% to 23% better than those produced by other state-of-the-art schemes. Furthermore, our partitioning algorithm is significantly faster, often requiring 4 to 15 times less time than that required by the other schemes. Our multilevel hypergraph partitioning algorithm scales very well for large hypergraphs. Hypergraphs with over 100,000 vertices can be bisected in a few minutes on todays workstations. Also, on the large hypergraphs, our scheme outperforms other schemes (in hyperedge cut) quite consistently with larger margins (9% to 30%).
Recent work [2] [5] [11] [12] [14] has illustrated the promise of multilevel approaches for partitioning large circuits. Multilevel par-titioning recursively clusters the instance until its size is smaller than a given threshold, then unclusters the instance while applying a partitioning refinement algorithm. Our multilevel partitioner uses a new technique to control the number of levels in the matching-based clustering phase and also exploits recent innovations in clas-sic iterative partitioning [7] [10]. Our heuristic outperforms numerous existing bipartitioning heuristics, with improvements ranging from 6.9 to 27.9% for 100 runs and 3.0 to 20.6% for just 10 runs (while also using less CPU time).
In this paper, we present design for testability (DFT) and hierarchical test generation techniques for facilitating the testing of application-specific programmable processors (ASPPs) and application-specific instruction processors (ASIPs). The method utilizes the register transfer level (RTL) circuit description of an ASPP or ASIP and tries to generate a set of test microcode patterns which can be written into the instruction read-only memory (ROM) of the processor. These lines of microcode dictate a new control/data flow in the circuit and can be used to test modules which are not easily testable. The new control/data flow is used to justify precomputed test sets of a module from the system primary inputs to the module inputs and propagate output responses from the module output to the system primary outputs. If the derived test microcode cannot test all untested modules in the circuit, then test multiplexers are added to the data path to test these modules and thus testability of all modules is guaranteed. This scheme has the advantages of low area and delay overheads (average of 3.1% and 0.4% respectively), high fault coverage (>99.6% for all cases) and at-speed testing. Test generation times are about three orders of magnitude smaller than an efficient gate-level sequential test generator. Only one external test pin is required for the DFT method.
We examine frequency-domain issues in the design and selection of on-chip test generators for built-in self-test (BIST) of high-performance digital filters. Test-generator/circuit compatibility is identified as a significant factor in testing large filters. A fault-injection experiment is used to show that when an incompatible test generator is used, high fault coverage (over 99%) does not guarantee that all serious faults will be detected. The frequency-domain characteristics of some basic test generation schemes are examined, and guidelines for test generator selection are proposed. Analytical techniques for identifying frequency-related testability problems are discussed, and several test generation schemes are evaluated by fault simulating them against lowpass, bandpass, and highpass fil-ters. A mixed test generation scheme is shown to reduce the number of untested faults by a factor of two to three over a standard linear-feedback shift-register (LFSR) based test scheme, at little added cost.
In systems consisting of interacting datapaths and controllers and utilizing built-in self test (BIST), the datapaths and controllers are traditionally tested separately by isolating each component from the environment of the system during test. This work facilitates the testing of datapath-controller pairs in an integrated fashion. The key to the approach is the addition of logic to the system that interacts with the existing controller to push the effects of controller faults into the data ow, so that they can be observed at the datapath registers rather than directly at the controller outputs. The result is to reduce the BIST overhead over what is needed if the datapath and controller are tested independently, and to allow a more complete test of the interface between datapath and controller. Fault coverage and overhead results are given for four example circuits.
This paper introduces a directed hypergraph model that supports (1) work flow composition and reconfiguration while accessing and executing programs, data, and computing resources across the Internet, (2) synchronous and asynchronous peer-to-peer interaction between members of any team during work flow composition and execution, (3) synchronous and asynchronous peer-to-work flow interaction between any team member and any object in the work flow. Given a library of program and data nodes, editing the work flow and its execution is as intuitive as the hierarchical schematic design capture and simulation. Examples of multi-site multi-user applications demonstrate that the proposed work flow implementation provides a user-friendly paradigm for distributed and collaborative team design.
A number of industry trends are shaping the requirements for IC and electronic
equipment design. The density and complexity of circuit technologies have
increased to a point where design cannot be performed without EDA tools. The
availability of completely designed and verified reusable design components has
become a major impediment to meeting required design productivity goals. Design
reuse is moving down the package hierarchy to include chip design in addition
to PCA design. At the same time, the widespread use of Internet and Worldwide
Web technologies offer component suppliers global access to customers, and
component users access to suppliers, worldwide. To realize the full potential
of modern technology, it is necessary that a level of consistency be defined
and used to represent component information. Standard formats for datasheets
and standard terminology within datasheets aids customers in their analysis and
value-comparison of components from multiple suppliers. Additionally, standards
are required to facilitate software tools than can help automate current manual
processes to populate customers Component Information System (CIS) databases
and generate CAD libraries from this component information. Electronic
Component Information Exchange (ECIX) is a multi-company project that is
focused on the development of standards and architecture that supports the
flow of component information from suppliers to all potential customers in a
manner that improves productivity for:
The important step towards a comprehensive CAD framework is the development of a suitable, complete design model on which the design system's components are based. To date, we generally find 'island' solutions for different aspects as data and process management, but in future, we need more and more integrated solutions. Only the integration gives us the traceability we need for design planning, to generate parts of the design tool's code automatically, etc. This paper describes how a suitable Design Task Model can be used to link the Product and Flow Models which are currently separated in most frameworks. Using the PLAYOUT Design Model as an example, we show how the Product, the Task, and the Flow Models may fit together, the development of consistent, integrated models.
This paper presents an effective technique for compacting a large sequence of input vectors into a much smaller one such that when the two sequences are applied to any circuit, the resulting power dissipations are nearly the same. Specifically, this paper introduces the hierarchical modeling of Markov chains as a flexible framework for capturing not only complex spatiotemporal correlations, but also dynamic changes in the characteristics of the input sequence. The new framework has a high degree of adaptability, i.e. the hierarchical model is dynamically grown according to the sequence behavior. Experimental results demonstrate that large compaction ratios can be obtained without a significant loss in accuracy (less than 5% on average) of power estimates.
This paper presents a new approach for estimating power dissipation in a high performance microprocessor chip. First, characteristic profile (including parameters such as the cache miss rate, branch prediction miss rate, pipeline stalls, instruction mix, memory references, etc.) is extracted from application programs. Then, mixed integer linear programming and heuristic rules are used to gradually transform a generic program template to into a fully functional program. The synthesized program exhibits the same performance and power dissipation behavior (as characterized by the extracted profile), yet it has an instruction trace orders of magnitude smaller than the initial trace. The synthesized program is subsequently simulated on a register-transfer level description of the target microprocessor to provide the power dissipation value . Results obtained for the Intel's Pentium processor executing standard benchmark programs show a simulation time reduction by 3-5 orders of magnitude.
Presented here is an analytical methodology to determine the average signal activity, T , from the high- level signal statistics, a statistical signal generation model, and the signal encoding. Simulation results for 16 bit signals generated via AR (1) and MA (1) models indicate an estimation error in T of less than 2%. The application of the proposed method to the estimation of T in DSP hardware is also explained.
Buffer insertion seeks to place buffers on the wires of a signal net to minimize delay. Van Ginneken [14] proposed an optimal dynamic programming solution (with extensions proposed by [7] [8] [9] [12]) such that at most one buffer can be placed on a single wire. This constraint can hurt solution quality, but it may be circumvented by dividing each wire into multiple smaller segments. This work studies the problem of finding the correct number of segments for each wire in the routing tree. Too few segments yields sub-par solutions, but too many segments can lead to excessive run times and memory loads. We derive new theoretical results for computing the appropriate number of buffers (and hence wire segments) which motivate our new wire segmenting algorithm. We show that using wire segmenting as a precursor to buffer insertion produces solutions within a few percent of optimal, while using only seconds of CPU time.
Academic clock routing research results has often had limited impact on industry practice, since such practical considerations as hierarchical buffering, rise-time and overshoot constraints, obstacle- and legal location-checking, varying layer parasitics and congestion, and even the underlying design flow are often ignored. This paper explores directions in which traditional formulations can be extended so that the resulting algorithms are more useful in production design environments. Specifically, the following issues are addressed: (i) clock routing for varying layer parasitics with nonzero via parasitics; (ii) obstacle-avoidance clock routing; (iii) a new topology design rule for prescribed-delay clock rout-ing; and (iv) predictive modeling of the clock routing itself. We develop new theoretical analyses and heuristics, and present experimental results that validate our new approaches.
In this paper, we present an efficient heuristic algorithm for the layer assignment and via minimization problem for multi-layer gridless IC, PCB, and MCM layout. We introduce the notion of the extended conflict-continuation (ECC) graph to represent the multi-layer layer assignment problem. Our algorithm is based on a linear time optimal algorithm that solves a special case of the layer assignment problem when the ECC graph is a tree. For the general layer assignment problem where the ECC graph is not a tree, our algorithm constructs a sequence of induced subtrees in the ECC graph and applies our linear time optimal algorithm to each of the induced subtrees. We have applied this algorithm to the via minimization problem and get very encouraging results. We have achieved 13%-16% via reduction on the routing layout generated by V4R router[13], which is a router known to have low usage of vias. We successfully applied our algorithm to routing examples of over 30,000 wire segments and over 40,000 vias.
In this paper, we consider non-uniform wire-sizing under the Elmore delay model. Given a wire segment of length C, let f(x) be the width of the wire at position x, 0 < x <= L. It was shown in [2, 5] that the optimal wire-sizing function which minimizes delay is an exponential tapering function f(x) = ae^(-bx), where a > 0 and b > 0 are constants. Unfortunately, [2,5] did not consider fringing capacitance in deep submicron designs. As a result, exponential tapering is no longer the optimal strategy. In this paper, we show that the optimal wire-sizing function, taking fringing capacitance into consideration, is f(x) = -cf/2c0 ( 1/(w(-cf/(ae^(-bx))))+1) where W(x) = Σ (n=1 to infinity) ((-n)^(n-1))/n! x(n) is the Lambert's W function, cf and co are the respective fringing capacitance and area capacitance of wire per unit square, a > 0 and b > 0 are constants. he optimal wire-sizing function degenerates into an exponential tapering function as cf = 0, and degenerates into a square-root tapering function (f(x) = sqrt(b-ax), where a > 0 and b > 0) as cf -> infinity. Our experimental results show that the optimal wire-sizing function can significantly reduce the interconnection delay of exponentially tapered wires. In the case where lower and upper bounds on the wire widths are given, the optimal wire-sizing function is a truncated version of the above function. Finally, our optimal wire-sizing function can be iteratively applied to optimal size all the wire segments in a routing tree for objectives such as minimizing weighted sink delay, minimizing maximum sink delay, or minimizing area subject to delay bounds at the sinks.
We present an improved procedure for fault simulation under the multiple observation time approach based on state expansi