Pentium family breaks up opetations in several microops (i.e. load/store/alu etc.), whereas AMD uses a 'reduced' microcode set. The microcode is (for all I know) hardwired in the dice.
Real differences come when one processor needs to execute 4 mops compared to the 2 of another...
The Pentium mobile is more based on p3 design than p4 on this point, as it uses bigger hardwired microops.
p1 wasn't 1op/1cycle really. An op usually took 6 cycles, and was sais it was executed in one cycle only because it was sent in the pipeline:
Code:
xxxxx1
xxxxx1
xxxxx1
12345|||
so, after the operation entered the pipe, no AGI, read-over-write and not other situations (i.e. U/V misleading), the op was executed in '1 cycle'.
AMD is faster than Px family for this reason, but it is more 'sensible' to first level cache misses due to the reduced pipeline.
I noted PM too uses an increased L1 cache for this, so I can only (wildly) guess the time loss for accessing L2 with a shorter pipe is much bigger (in %) than with a longer pipe.
Anyone knows more on the subject?