Intel's latest Lunar Lake designs extended the width over previous cores as well as having high clocks (and will likely be the case for the upcoming P-cores as well). There were studies in the past that examined diminishing returns of 'wideness' but I can't think of what to search for to find the main one I'm thinking of right now.
I remember reading a paper from a millenium long gone, where the authors simulated a "perfect" CPU. It was infinitely wide (i.e. an unlimited number of execution units for each type of instruction), branch prediction was always guessing right, all instructions had single cycle latency (implying that all memory accesses were cache hits), and so on. Their perfect simulated CPU was only limited by causality; i.e. it could not magically guess values before it had actually computed them.
On this simulated processor they were running a workload that was both important and difficult (for real CPUs) to execute quickly: a compiler. Over the complete compiler run, the "perfect" CPU reached an average IPC (instructions per clock) of roundabout 2000.
The researcher's next step was to introduce limits into their simulation. The less perfect (but still way beyond realistically feasible) simulated CPU was 2000 wide: up to 2000 instructions could be executed in any single clock cycle, but not more. The rationale was that the first experiment suggested that this should be enough on average to run the compiler near the "causal limit".
So they made a run with the 2000 wide simulated CPU and got an effective IPC of ...
drumroll ... eight. Just eight instruction per clock cycle executed on average over the whole compiler run.
On closer inspection, the researchers found that the "perfect" CPU got most of its speed from its ability to look arbitrarily far into the future of the running program. So it found independent work even across compiler phases, and so on; this allowed it to be extremely bursty, with an individual clock cycle potentially executing millions of instructions, preemptively making up for many stalled cycles later on.
The 2000 wide CPU could not come anywhere near such burst benefits.
(I have tried to find this paper unsucessfully at least four times since I first read it. Sigh.)
BTW, reality has since surpassed even the perfect simulated CPU a little bit. Nowadays we do things like instruction fusion, where (causally) dependent instructions are executed not in subsequent clock cycles, but in a single clock cycle; this is actually one clock cycle faster than the perfect CPU above. And our CPUs often have fairly powerful SIMD execution units, which sometimes deliver us the performance of a CPU much wider than what we actually have.
In practice, any complicated workload that reaches even only an average IPC of 1.0 on a real, 8 wide, machine is already fairly rare. Some optimized workloads or very regular algorithms can break an IPC of 2.0 on a real CPU core, but that almost always involved a lot of brain cycles and a lot of work to get there.
The vast majority of program code has never been tuned to that point, and average IPCs below 1.0 are commonplace.