Theorem proving techniques are particularly well suited for reasoning about arithmetic above the bit level and for relating different levels of abstraction. In this paper we show how a non-restoring integer square root algorithm can be transformed to a very efficient hardware implementation. The top level is a Standard ML function that operates on unbounded integers. The bottom level is a structural description of the hardware consisting of an adder/subtracter, simple combinational logic and some registers. Looking at the hardware, it is not at all obvious what function the circuit implements. At the top level, we prove that the algorithm correctly implements the square root function. We then show a series of optimizing transformations that refine the top level algorithm into the hardware implementation. Each transformation can be verified, and in places the transformations are motivated by knowledge about the operands that we can guarantee through verification. By decomposing the verification effort into these transformations, we can show that the hardware design implements a square root. We have implemented the algorithm in hardware both as an Altera programmable device and in full-custom CMOS.