Jun 102010

I was interviewed recently by EEMBC, the embedded computing industry’s foremost benchmarking consortium.  You can read that interview here.

  5 Responses to “EEMBC interviewed me about CoreMark”

  1. The memory system bus speed may not be important to CoreMark’s code once loaded into a core’s cache, but communication between multiple cores still operates on that bus, doesn’t it? Wouldn’t that be a factor? Or are shared, on-die cores operating at non-memory-bus speeds?

  2. Hi, Rick. If it’s an SMP system, then the cores communicate and coordinate over the bus, but multi-core chips do most of this through their integrated caches which usually operate at core speed.

  3. Van, that’s interesting. I’ve always thought the same-die multi-core CPUs still communicated over the system bus (and the MOESI protocol), they just did it on-die, with the cache being its own component that’s sort of “in the loop,” as it were, sending data to the cores as it’s requested and present. I didn’t realize there were different integrated buses internally and externally.

    How does that work with power planes and variable-clocked cores? The bus between cores must be maintained at a constant voltage and clock speed then, separate from the thing it’s communicating with, and regardless of the other core’s compute clock speed, right? Or in the alternative, when a particular core goes high / low voltage, there must be circuitry at every other core which auto-adjusts up/down to transmit to the other cores?

    • Well, you’re right for some of the very primitive “dual-core” implementations like the P4-D and the “dual-core” Atoms. In those cases, you really have an SMP system on a package where two discrete dice are tied together over a shared bus. The 65-nm “dual-core” VIA Nano (CNB) prototype at the last Computex was implemented similarly, but it apparently will not be released. Instead, Centaur plans a true dual-core design for the 45-nm shrink. I believe that design will be very similar to a Core2 Duo with a large, shared L2. The last I heard, Glenn, Al and Rodney were targeting C2D-level IPC as well, and they were confident that they could reach it.

      I’m not an expert on maintaining cache-coherency in true multi-core environments, particularly with integrated memory controllers like you see with AMD, ARM and newer Intel processors — in those cases, there is no external FSB. I believe they all communicate with the outside world using high-speed serial buses. I think Centaur planned to use PCIe for this.

      Power management with modern CPUs has become extremely complicated. My experience has been with single-core chips. My assumption has been that multicore chips need separate power planes if each core is going to be turned off individually. Regarding P-States, I’m sure each core has its own PLL triggered by the same master B-Clock and the only way for voltages to be different for each core is if they each had their own power plane.

      One issue that has been problematic over the years is the timestamp counter (TSC). Microsoft issued a recommendation many years ago that the TSC should be incremented at a constant rate regardless of the true CPU frequency; this would make it easy to keep the TSCs in sync between cores. However, it hasn’t worked that way until recently and it was a headache for benchmark developers who use the TSC as a high resolution timer.

      • Van, definitely makes sense. I remember reading about the RDTSC issue as well. I also remember Intel working on their 40-core Terascale 2 CPU, which contained an undisclosed x86 core (though I believe it was based on the modern Pentium-like in-order version seen in Larrabee), which had a 6-way proprietary bus architectures for massive data migration per core per clock.

        Terascale 2’s die cores were arranged in a tile pattern with each core having north/south/east/west communication in 4-ways, and then 1-way dedicated back-and-forth to main memory and 1-way dedicated back-and-forth to non-adjacent cores. The total aggregate bandwidth on that CPU was theoretically unreal (don’t remember exact figures, but I believe around 256 Terabits/s (320 GB/s) on around 100 watts).

        When I interviewed Jerry Bautista in 2007, he was extremely excited about Terascale 2. I even managed to arrange a meeting between him and another person I had interviewed previously, Dr. Ioannis A. Kakadiaris of the Texas Learning & Computation Center at the University of Houston. Dr. Kakadiaris hoped to utilize the amazing compute potential of Terascale 2 for 3D medical scanning systems he was developing, that had just a volume of data that it was not feasible to conduct in real-time. He was hoping to develop “agent algorithms” which would sift through the 3D medical scans, looking for problems automatically, without a doctor having to eyeball the scans to see the tumors, etc. It was a fascinating medical technology, and an equally fascinating application of the hardware side to solve it.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>