Is deformation imaging by ultrasound useful?

- Clinical use and scientific evidence.


The page is part of the website on Strain rate imaging
Myocardial deformation imaging by ultrasound / echocardiography
Tissue Doppler and Speckle Tracking

by Asbjørn Støylen, dr. med.

Contact address: asbjorn.stoylen@ntnu.no



This section updated: June 2015.


It is no doubt that deformation imaging has lead to a deeper understanding of both physiology and geometry of myocardial mechanics. However, the clinical utility relies on documentation in clinical studies. But in clinical echocardiography, , an echocardiographic examination is always using all the available data, weighing them against each other and integrating information. The conclusion is usually fairly certain, even if single measurements are not. Thus, clinical ultrasound is like clinical examination - in fact it is an extended clinical examination - meaning it will be a craft, not pure science, and documentation in studies will be limited.

Remember - there is no, or very little - evidence base for clinical examination.

Therefore, the next section will be how to, and the last will be a discussion of clinical evidence from studies.











Other sections:

Measurements of strain and strain rate by ultrasound

Basic principles of ultrasound and scanner technology.

Mathematics of strain and strain rate

Back to main website index


What's the difference between scientific evidence and clinical practice of echocardiography?

Basically, clinical practice aims to be evidence based.


The guidelines for echocardiography, generally lack both strength of recommendations and  level of evidence assessment. In general, there is lack of systematic procedures for evaluating the different diagnostic modalities. This is especially relevant in the clinical practice of echocardiography, where an integrated evaluation of function based on a lot of measures as well as visual assessment is the standard. In echo most clinical studies are fairly small. This is partly an economic issue, they are not driven by industry, but by academic institutions, as well as a practical issue, as most studies needs fairly many measurements, and the outcomes studies are most often only single measurements or indices.


The careful weighing of the evidence in terms of the methods limitations is thus an integral part of the echocardiographic examination. Thus, background knowledge improves the clinical judgment. But using more of the available information (like strain rate) improves the diagnostic performance, and in turn the clinical judgment.

This approach is for practical reasons nearly impossible to do in a quantitative study, and approaches the more qualitative research of using background knowledge. With the limitations inherent in basic ultrasound and in the specific methods, clinical ultrasound will partly be a craft, not pure science, and a knowledge of the methods themselves and the method specific limitations is essential. in the manner of an extended clinical examination. In many ways clinical echocardiography can be considered to be exactly that, an extended clinical examination, more than a specific imaging. It may be important to remember that clinical examination and medical history, being the two main cornerstones of medicine, has no scientific evidence whatsoever, despite this being the main criterion for selecting patients for any diagnostic procedure.

The conclusion will usually be fairly certain in the hands of an experienced clinician, even if single measurements are not. This is a fundamental property of all echocardiography. Even if the evidence base of many of the measures is shaky, clinical practice of echocardiography is still evidence based. The practice of evidence based medicine is taking the best evidence there is, and applying clinical judgment based on experience.

Deformation imaging, however, is no worse than other echo modalities, once a critical view to the limitations are applied but should be used as part of the whole echo examination. In clinical use, the main point is whether they give added diagnostic value, compared to basic echocardiography.



Feasibility

With a liberal attitude to excluding segments, the question becomes related to the feasibility of analysis. accuracy may be as good as it will, if there is too low feasibility.  We did a feasibility study (115) at our department, showing that the there were reverberations in more than 80% of the patients. With the high number of reverberations artifacts reported, it may well be that the real feasibility should be around 80% of segments, and that studies reporting more than 90% by manual analysis may in fact have a high number of artifacts included in the data.




In automated analysis

one uses manual or automatic placement of anatomical landmarks, such as the mitral plane and the apex, or draws a curve along the myocardium. The walls are then automatically segmented, and strain rate calculated according to the application used. This is implemented in the segmental strain application of NTNU,  as well as in the various commercial speckle tracking methods. This automatic segmentation will in itself give a better repeatability than manual placement of the ROI, without recourse to smoothing. By this method, segments are discarded due to poor tracking, poor alignment , visible presence of reverberations as well as curve quality. Automated post processing will give segmental values once the region of interest is defined, eliminating the search for suitable curves, thus resulting in more objective traces. Discarding segments with poor data quality remains the only option  in manual evaluation of results.

In a pilot study (127), the feasibility was between 75 and 80% of segments, using the automated segmentation, both with velocity gradient (in mid segment) or by segmental strain, while a little over 90% were analysable manually. This last number may be an overestimation, and an example of biased post processing, in this study. The higher feasibility of manual analysis results from the possibility of reducing ROI and strain length, trying to get useful information out of small areas between artifacts. At that time, we were more prone to accept all curves that seemed reasonable. In addition, the numbers were too small to demonstrate differences in diagnostic accuracy.

In stress echo, a study using the combined application of NTNU using both TDI and speckle tracking, did show that while WMS had a feasibility of 99% at baseline and 98% at peak stress, segmental strain had a feasibility of 86 and 79% for strain rate and strain, respectively at baseline and 84 and 77% at peak stress. The velocity gradient method had the lowest feasibility with 80% for strain rate and 65% for strain both at baseline and peak stress. The peak stress However, even with  lower feasibility than WMS, the diagnostic accuracy was shown to be higher, confirming independent diagnostic value. This was also confirmed by another study (133), showing added prognostic value to WMS of deformation measures, given a feasibility at peak stress of 93% for strain rate and 87% for strain by the segmental method.

In a method study, feasibility of segmental strain and 2D strain was between 70 and 80% of segments (151). In the HUNT study (153), on the other hand, also using the combined segmental strain, the feasibility was lower, about 60% for both strain and strain rate at rest. It should be emphasized that this, although being partially due to the limitations of segmental strain (which renders both segments above and below the kernel infeasible for analysis), the main emphasis in this study was to provide normal data, and make the measurements as clean as possible. Due to the large number of subjects, the possibility to exclude liberally, ensuring that the result were free from bias due to artefacts, were present. Basically the low feasibility here is a characteristic of the study, not the method. This is also the main strength of the study, not a weakness as some mistakenly maintains, when they confuse study and method. That this policy did succeed, is shown in that the values are normally distributed, as opposed to findings in some other population studies.

Unreasonable results, even if the curves are within the possibility, should still be carefully assessed before being accepted. This is the tricky point, as this may introduce bias. In this case, the parametric images should be assessed as well, to see whether the finding shows up in the semi quantitative plot: Any completely unreasonable curves should be discarded outright, even if the segments are accepted in the first place. The presence of reverberations, do not affect the analysis of timing in the strain rate CAMM; however, it can usually be recognsed, so the strain fate analysis is still not entirely without information.

In areas with poor data, the traces will depend on position of ROI. Both apparent pathology in normal areas as seen here, and in the limitations paragraph and apparent normal function in pathological areas may be produced. Trying to eke out meaningful curves from poor quality data, and accepting those you find reasonable will result in biased post processing.

With a liberal attitude to excluding segments, the question becomes related to the feasibility of analysis. accuracy may be as good as it will, if there is too low feasibility.  We did a feasibility study (115) at our department, showing that the there were reverberations in more than 80% of the patients. With the high number of reverberations artifacts reported, it may well be that the real feasibility should be around 80% of segments, and that studies reporting more than 90% by manual analysis may in fact have a high number of artifacts included in the data.

In addition, the specific limitations of each method has to be considered.
Thus, emphasis should be on:
    - Excluding areas with reverberations (segments on both sides), irrespective of whether the values look meaningful.
    - Excluding the apex in foreshortened images, and also other areas of hgh curvature may be viewed critically.


However, this is not specific to deformation imaging alone. During an integrated, comprehensive echo examination, the integrated approach always includes weighing the single findings against the totality, resulting in changing the interpretation or discarding some measures as being inconsistent with the whole.

Evidence based Strain Rate Imaging


In order to discuss the evidence base for echocardiography, some of the basic principles for evidence based medicine will be summarised first, for comparison's sake. This is basically an introduction to the field of evidence based medicine, and a brief discussion on it's application to imaging in general, and to echcardiography. The main purpose is to use the concepts to discuss the evidence base for strain rate imaging (although I haven't come too far on that, yet).

Some or all of this may be painfully obvious or even banal to some. If so, parts, or all of it can be bypassed by the following links:
This is all new. I intend to revise the part about evidence for strain rate imaging, but havent come so far yet, so the part below the general consideration of echocardiography is all old, meaning that some new references are lacking, and what is there is not molded into the new structure of this section.


Evidence based medicine

As defined by Sackett (307)

"Evidence based medicine is the conscientious, explicit, and judicious use of current best evidence in making decisions about the care of individual patients. The practice of evidence based medicine means integrating individual clinical expertise with the best available external clinical evidence from systematic research. By individual clinical expertise we mean the proficiency and judgment that individual clinicians acquire through clinical experience and clinical practice."

As scientific evidence is statistical in nature, in dealing with the individual patient, clinical judgment has to be applied as well in making decisions.
Thus, the main point is taking the evidence there is, and using it in concert with clinical judgment based on experience and basic knowledge.


It does not mean


In evidence based medicine, the patient's expectations, values and context will also be a part of the decision process, as patient compliance is an important contributor to the efficacy of any treatment. However, this also presupposes patient information and education, of course (323). This concept is furthest developed in selecting treatment interventions.

Evidence based treatment


Basically, this means
Clinical judgment is based on the clinical experience and general knowledge. Basically, clinical experience represents considerable knowledge, which is important in formaing a diagnostic or therapeutic judgment, especially in light of the limitation of all kinds of clinical studies. However, the clinical experience may tend to be biased, as positive experience tends to be reinforced, while negative experience may be disregarded. Therefore, clinical experience is not considered scientific evidence, although it is still part of the basis for evidence based medicine.

Patient expectations, knowledge, values and context (e.g. economical or geographical), is a prerequisite for the success of any treatment. The best medical treatment is useless, if the patient doesn't take the medicine. Surgery will have much less effect unless followed by patient compliance to a rehabilitation regimen etc. However, this also presupposes that the patient is educated in the problem, so the expectations are founded in a realistic knowledge of what can be expected.

Scientific evidence

Scientific evidence about treatment is generally considered in different levels:

Individual clinical judgnent is still a part of the practice of evidence based medicine, and this is based on general knowledge and experience. External scientific evidence needs to be evaluated, both for relevance and quality. Finally, the patient's expectations should be taken into account, as treatment compliance is necessary to achieve the treatment goals. Simplified evidence pyramid, showing the rating of evidence from bottom to top. AS there comes more RCTs, the possibilities of systematic reviews superseeds the evidence from single RCTs. However, as seen by the definition above, the practice of evidence based medicine returns to the background knowledge as the foundation of clinical practice.


The evidence pyramid varies slightly in different publications. In clinical guidelines, a simplified series of three levels is often used.

To those considering that only RCTs constitute definite evidence, however, it is important to remember that there are no RCTs demonstrating that smoking is harmful, only experimental evidence in animals and observational studies. Still, no one (at least outside the tobacco industry), would deny that the evidence is overwhelming.

In general, the first randomised studies are also small, often studying surrogate end points (like cholesterol, blood pressure or ejection fraction).
Demonstration of effect on a surrogate end point, do not necessarily mean that there will be effect on a true end point.

This is illustrated for instance by the fact that while the association between cardiovascular and mortality risk from high cholesterol was known (319), intervention with gemfibrozil did not reduce mortality (320) despite being very effective in reducing the surrogate endpoint (Total and LDL cholesterol). There was reduced cardiovascular mortality, however, but this was offset by an increased non-cardiovascular mortality in the treatment group. Wether this reflects no effect at all, and simply a post hoc statistical redistribution, or a real increase in gemfibrozil side effects that offset the positive cardiovascuar risk reduction, remains unclear. However, more recent studies of statins seems to indicate an all-couse mortality reduction as well as reduction in cardioascular deaths and events (321), this would suggest side effetcs of gemfibrozil, although it might also mean that the sttins have other effects besides lipid lowering (for instance on endothelial function, meaning that the LDL lowering might not the primary mechanism).

The absurdity of arguing from surrogate end points, howvever, is more clear, if suggesting to use hyaluronate injections in the ear lobe to abolish the ear lobe crease, or to use liposuction to reduce waist circumference, eventhough both coronary risk factors with some evidence, there is no credibility to the hypothesis that those interventions will work against risk, and is not even worth testing in a study.

True end point studies (true end points being survival, clinical events - if that is the treatment goal, or symptoms and QoL if symptomatic treatment is the goal), in general are large scale studies, to achieve sufficient study strength.


RCT's with clinical end points often need to be large, in the order of thousands of patient years, which is very costly. This is usually available in drug treatment, being financed by the revenue from the drug sales during the patent period.

In drug treatment the trials are often classified as
Thus, to reach top level of evidence, phase 3 trials must be done.

Randomised clinical trials

have several strenghts:

In order to test true end points (like survival, events, but also Quality of Life and symptoms - for symptomatic treatment), instead of surrogate end points (like ), they have to be fairly large. In addition, the blinded design and structured follow up requires an extremely costly organisation.

For the economical reasons, large scale RCTs are extremely rare in non-patentable intervention like exercise training or surgery. Meta analysis of several smaller studies (often using surrogate primary end points) may achieve similar strength, and is considered an acceptable alternative, but is hampered by the weakness due to publication bias, eliminating non-published negative studies from the analyses. Clinical trial databases giving an overview of all studies being performed, may correct this.

However, blinding will only partly be possible in f.i. surgery or exercise training, although evaluation of outcome may be performed by a blinded observer, the blinding of patients is in general not possible, and neither is bloinding of those actually performing the treatment (f.i. surgeons or exercise therapists), thus they may impart a placebo effect.

Limitations of RCTs

Thus the generalisability, ad hence, the transferability from RCTs is limited, and must always be considerd in the light of other evidence. Instead of being top of the evidence pyramid, they should possibly be considered tip of the iceberg:




 
Considered the limitations of RCTs, they may be condiered only the tip of the iceberg. Both background knowledge and clinical experience forms the clinical judgment for evidence based practice, and registry studies are the only addressing real life problems in long term perspectives, although with other limitations.
There is more to evidence than that which meets the eye (in fact at least 80%). Tip of an iceberg. But still over 100 m above the sea. Ilulissat ice fjord, Greenland

as most of the iceberg is below water, most of the medical knowledge is outside the scope of clinical trials, maybe as much as 80%.


This is why clinical guidelines recognise the level of evidence C, being limited to small studies, registry studies, or even to expert consensus (based on considerable general knowledge of physiology, general and specific disease knowledge etc.

Registry studies

has certain advantages over RCTs, in that they
On the other hand, they are observation studies:
Thus, registry studies are in general considered to be observational studies, ant thus fairly low on the evidence pyramid, on the other hand tey are the studies that provide phase IV information, which on the other hamd is the highest level of information.

Clinical guidelines

To help in evaluating evidence, clinical guidelines usually contains treatment recommendations (309, 310):
Level of evidence is also classified:
which is an even more simplified evidence pyramid. This means that for evidence based treatment, phase III studies may be both level A and B. This classification do not take into account the impact of registry studies that are phase IV studies.

This means that instead of evaluating the scientific evidence each time, the guidelines offers the support of having evaluating the evidence, and in EBM; they may be used as the evidence part. But this still means that they should be used as exactly that: guidelines, and not as a cookbook for treatment decisions. That will not be real evidence based medicine, leaving out the clinical judgment and patient's preferences. Guidelines only adress one problem at a time, meaning that they are made for isolated diagnosis, and to a very little degree address for instance co morbidity. Thus the guidelines are generally not the only decisive factor, thus, in the individual case the clinical judgment has to be applied. And finally the patents expectations and preferences has to be taken into account, but for this, the patient also has to be educated in the problem.


In general the concept of evidence based medicine is also not without problems (312), mostly due to the limited quality and scope of the evidence as discussed above, but also the facts that there is potential for abuse, as the adherence to guidelines may be to rigid (although that is not what is intended by the concept).

Finally, the evidence often relies heavily on the diagnostics which in themselves may have fairly limited evidence

Evidence based diagnostics

In diagnostics, however, the basis for evaluating the evidence, as well as implementing a framework for evidence based diagnostics, have come somewhat shorter (308).

For diagnostics, guidelines should be about applying the tests, based on the evidence about the usefulness of the test. 
Another kind of guideline is the how-to guidelines especially in the field of echocardiography, but these are to a very large degree, more experience than evidence based.



Clinical treatment guidelines often contains recommendations of applying diagnostic tests. However, they are very often further removed from the clinical practice, as most guidelines are about established specific diseases or groups of diseases (like heart failure or valvular heart disease), and not about applying the test in diagnosis in a less selected population where diasease is suspected. Thus, proper guidelines for diagnostics should take a clinical issue as the starting point, and evaluate the diagnostic modalities in relation to that (317, 318).

Thus the relevance of most treatment guidelines is lower for diagnostics than for treatment.


It's interesting that f.i. clinical management guidelines for heart failure (309, 310), contains class I recommendations for treatment (based on the EF being below 40%), but evidence for echocardiography or other imaging (which is necesary to measure that EF), is only considered at level C (although evaluation of evidence for BNP is wildly diverging). This, of course, is methodologically problematic, in terms of logic, it seems irrational that the treatment based on a certain diagnostic method has higher evidence than it's premises.

The evaluation of imaging at level C, however, reflects the fact that research in diagnostics in general consists of smaller studies. The design of studies needs to be different, as the end points are different, but the end points again need to be defined at each stage. Much of the information about diagnostic methods is at this level.

At the outset, it may be important to remember that clinical examination and medical history, being the two main cornerstones of medicine, has no scientific evidence whatsoever.  Still, the selection of patients by clinical means, remains the main method selecting patients for any diagnostic procedure, both in studies (phase III) and in daily practice, especially in relations to selecting a test in relation to positive and negative predictive value.

Diagnostics consist of:

To understand the evidence base of diagnostic tests, some terms will be discussed:

Validity and reliability:

Validity of a measurement, means to what degree a it measures what it is meant to (or purports to) measure.





Target shooting with two different weapons. The weapon on the left shows a high reliability, as the shots are well gathered. However,  the whole group and hence the average is off centre (the centre representing the ground truth), thus the method has less validity. The weapon on the right  shows better validity, as the average of the shots are on centre, but the shots are less  well gathered (more scattered), the weapon will tend to hit in a more different location each time, it is less reliable, the placement of the shot is more variable or less reproducible. Accuracy (measurement accuracy - not to be confused with diagnostic accuracy) and precision of measurements. They are the numerical expressions of validity and precision. Looking at the variability to the left, it is evident that the variation in intra-observer-same-cycle measurements, are due to variability in the measurement method itself. Variability between measurements in different acquisitions (as in different cycles, or even in acquisitions on different days), may contain some measure of biological variability as well.

Validation studies carried out in simulations, mechanical models (3) or animal experiments, usually is real validation, and usually carried out over a range of values or conditions (in animals). One example is strain and strain rate measured in animal models, compared to microchrystals (8) However, they usually do not tell how a method works in patients, where there may be considerable differences (for instance the difference between closed and open chest).


There are some problems with the concept of validity in medical diagnostics:

Reliability on the other hand, is consistency of a measure, meaning that the measurement will give similar results when measured in the same conditions. Variability is then the opposite of reliability/precision.


 Comparison of  three different ultrasound methods for deformation imaging, against tagged MR as reference from (151); Left 2D strain, middle segmental strain by combined tissue Doppler and speckle tracking, and right strain by dynamic velocity gradient. Top row: Bland Altmann plots, bottom row scatterplots with identity line shown. It can be seen that there is a significant bias between 2D strain and MR, while the measurements are fairly well gathered together, but on the average below the identity line. Thus, 2D strain seems to have best reliability, although theis is due to smoothing. The segmental method has a small bias, but this is not significant. The method is less reliable, as seen by greater scatter (and lower correlation). The method on the left shows no bias, i.e. good validity, but even greater scatter, and is clearly the least reliable. Comparison of variability in longitudinal strain measurements by MR, 2D strain, segmental strain from speckle tracking (ST-7P) and from combined method (TDI&ST) and finally from longitudinal velocity gradient. After (151) (the original publication has an error in the category labels). We see that some of the variability shown above is due to variability in MR. 2D strain has the lowest variability, but this is due to smoothing.



The reliability or precision of a method is addressed by repeated measurements or assessments. This can be done by repeating the measure
The intra observer repeatability in the same recording will be highest, and reflects the inherent limits of the method's precision, the best precision that can be expected.
The inter observer repeatability in the same recording will be higher, reflecting variations in how to do measurements, for instance strain rate. The degree of automatisation will tend to reduce the variability of doing measurements (like using 2D-strain or automated segmental strain instead of free  hand tissue Doppler ROI placement). So will the degree of normalisation of measurements for noise (like smoothing in 2D strain), of course.

Repeated measurements in different recordings, will have higher variability, as the variation also reflect the variation of how to do the acquisition. In echocardiography the variation will be less when the recordings are done during the same examination, as f.i. in two different heart cycles (but still detectable as seen in the study above), as when the test is done in two different examinations. If the examinations are separated by a substantial time interval, the variability will also be a function of within-subject biological variability.

Depending on what kind of test there is, there may be different test tools. Correlations, however, is somewhat dubious, as the correlation analysis presupposes independence, which cannot be said to be present when the same thing is measured twice (322).

Reliability can usually be assessed by repeated measurements by the same method, by the same or different persons, and in the same or different samples. Measurements of reliability are limits of agreement, repetition coefficient, and for categorical (like in most imaging) or ordinal data, there are kappa or weighted kappa coefficients.

Looking at the study above, (151), we see that there is a considerable variability also in the gold standard (MR even if the repeated measures was done only in the same recording), meaning, of course that it is no real gold standard at all.

Biological variability

Means both that measurements may vary between subjects (f.i. EF may vary depending on ventricular size and basal heart rate - as we may see in highly trained subjects).

Thus, there will be need to know the normal variation range of a measurement variable, usually defined as the range between 2.5 and 97.5 percentiles (which equals mean ± 1.96*SD in normally distributed variables), thus comprising 95% of the healthy population.


Distribution of a variable in a normal population. The percentages denotes the percentage of individuals at each value, which is equal to the probability of each measurement in the selected population. The range of the variable is commonly defined in statistics as the range encomapssing 95%  of the values, i.e. the range where the probability of finding the value is 95%. In a normal population, this is the NORMAL RANGE, which doesn't mean that the subjects outside are sick, ony that they are outside the 95% range. In this example the variable is normally distributed, (statistically normal, but this concept must again be separated from being normal in the sense of being healthy i.e. medically normal). The 95% range is in this case given by mean ± 1.96 SD. (In a normal distribution, mean ± 3 SD comprises about 99%, while the limits of the 100% range in principle are infinite, as probability never reaches zero.


Measurements may also vary by time in a single individual. The EF, for instance, may vary dependent on heart rate during the examination, and by hydration status. Some chronic diseases have highly variable activity in the disease, for instance reumatiod arthritis, affecting symptoms. The intra individual variation may be as large as the full distribution in the population, but, of course, not bigger. Usually, it is far smaller, as some of the variability in a population may be due to other differences like gender, size or age, in relation to f.i. ventricular volumes.

But in addition, the intra subject variation from measurement to measurement also is due to measurement variability. Most common, the measurement variability is smaller than the intra individual variability, but not necessarily. EF measurement by ultrasound, has limits of agreement of ± 10 percentage points in many studies, which is far more than the intra individual variation by other methods. (Still, this variation will be the limit for detecting a true change in any single individual, and will represent the intra individual variability.) Thus, the intra individual variability is due to both biological variability and measurement variability, and be decided by which is largest.

Both biological and measurement variability decides the diagnostic accuracy of a method.

Regression towards the mean

is a function of selection of study subjects from a population (patient or normal).

Regression towards the mean is due to the individual variation in measurements, both measurement and bilogical variability. If a sample population is selected from the general population on the basis of a cut of value, the selection will include some subjects where the true or biological mean value lies closer to the population mean than the cut off value. If they are included in the sample, the probability of a second measurement being above the cut of value is > 50%. For subjects whose true or mean value lies below the cut off, the second measurement may move in both directions with a probability of 50%, which will not affect the individual mean. However, this means that the sample population mean will move towards the population mean, i.e. being more normal in the second measurement.



Comparison between population variability, intra individual variability and measurement variability. Usually the intra individual variability is less than the population variability as shown here. The intra individual variability is due to both biological variation from time to time, and measurement variability. In this example, the measurement variability is less than the biological variability, thus most of the variation in the individual is due to biological variability, less to measurement, but this may not always be the case. If measurement variability is greatest, this is the source of all measurement variation in each individual.
Regression towards the mean. If subjects are selected from a population on the basis of a cut off value, the individual variability means that some subjects will be selected despite their mean or true value being above the cut off, as in this example. But due to this, the second measurement will have a higher probability of being above the cut off value (the pink area). Thus the selection as a whole will have a mean of the second measurement closer to the mean in the general population.

This is important in many settings:
If the selection parameter is also the outcome parameter in treatment studies, (f.i. BP or EF), there will be a regression towards the mean that will look as improvement due to the treatment. In controlled studies, the regression towards the mean may be expected to be similar in the intervention and control groups, and thus can be estimated. But this is much of the reasons why some have maintained that participating in a study is beneficial in itself, and also some of the effect that could be ascribed to placebo effect.

Longitudinal observational studies in f.i. registries do not have that kind of control, unless a case-control design is attempted in the data.

Diagnostic observation studies, for instance follow up after an infarct, will show the same, if selection is done on the basis of the parameter that is studied in follow up. If selection is by another parameter (for instance troponin measurement or coronary angiography) that that which is studies (for instance EF, peak systolic or post systolic strain rate), there will not be regression towards the mean.

In daily clinic, if a follow up decision is based on a measure like EF, some of the improvement seen may be due to individual variability and not real improvement.

Sensitivity and specificity

Neither validity, nor reliability are sufficient to describe the value of a diagnostic method. The main point of any diagnostic method is it's ability to discriminate between normalcy and disease.

For quantitative measurements there is need to establish cut off values between normal and abnormal measures. Other tests, which are non-quantitative, like much of imaging, establish diagnostic criteria, the reliability of applying those criteria can be tested statistically, and both for quantitative and non quantitative test the sensitivity and specificity must be tested in clinical studies comparing patients with healthy controls.

Again, the diagnosis needs to be determined by external methods. The sensitivity and specificity are measures of a tests discriminatory ability.





The test is:

The condition (disease) is
Present
Absent
Sum
Positive
True positive (Tp)
False positive (Fp)
All positive (Tp + Fp)
Negative
False negative (Fn)
True Negative (Tn)
All negative (Fn + Tn)
Sum
All with condition (Tp + Fn)
All without condition (Fp + Tn)
All (Tp + Fp + Tn + Fn)
  • Sensitivity is the  percentage of patients with disease who have a positive test test, i.e. sensitivity = Tp / Tp+Fn, the true positive fraction.
  • Specificity is the percentage of subjects without the disease who have a negative test, i.e. specificity = Tn / Tn+Fp, the true negative fraction.
  • Diagnostic accuracy (not to be confused with measurement accuracy above) is the number of true tests as proportion to all tests, i.e. accuracy =  Tp+Tn / Tp+Fp+Fn+Tn
The false positive fraction is 1-specificity.

Sensitivity, specificity and accuracy are usually more related to a methods reproducibility that to its validity. Even if measures have a bias compared to a reference method, reference specific normal values may be established. However, for any measurement, the discriminatory ability depends on how the variable is distributed in the patient population vs the healthy population:It is obvious that the sensitivity and specificity of a test are related to the reliability (precision) of the measurement method. A method with lower precision has a wider range of measurement results for any given ground truth value, meaning that there may be larger overlap between values that denotes sickness and health, due to the lack of precision.

Since sensitivity only deals with the diseased subjects, and specificity only deals with the subjects without disease, they are at the outset independent of the prevalence of disease. Thus they are generally considered to be properties of the test itself. However, they are influenced by where the cut off level is set.

Normal limits versus cut-off limits


However, the biological variation will also be a decisive factor for the sensitivity and specificity of a test. Very often the normal subjects and patients have a separate distribution of measurement values.



The healthy and sick populations usually have separate distributions, with some degree of overlap. The degree of overlap and the cut off value decides the false positive and false negative percentages, and thus the sensitivity and specificity of the test.


 and the sensitivity and specificity of a test is often dependent on how well the values in the two populations are separated. But that means that the ranges of the two populations may overlap:


Examples of different distributions of measurement values in patient populations (populations 2 - 6), compared to a healthy population (population 1). The overlap between the range of the healthy population and ranges of the patient populations are shown above the curves. As can be seen here, both the mean in the populations (compare populations 2 and 3, the overlap is proportional to the separation of means) , as well as the range of each population (compare populations 4 vs 2 - (less separation of means, but smaller range, same overlap with 1) and 5 vs 2 (more separation of means, but wider range, actually more overlap with 1)) will decide how much overlap there is with the healthy population. Populations 2 and 6 show the least overlap, but even population 6, having no overlap between ranges, will have s a small percentage of false positives and negatives (coloured areas), due to the definition of ranges being 95% of the population.

The separation beteween the healthy population and the patient populations is dependent on
  • The separation of means.
  • The range of the populations.
In scientific studies, this is what is tested by group comparisons between healthy and a defined patient population, in the phase I studies. The study will answer the phase 1 question: Do test results in patients with the target disorder differ from those in normal people?, but not very much else. The separation of means in it self do not tell enough about the test's usefulness, both the measurement precision as well as the distribution of measurements will decide this. The standard deviation (SD) tells about the distribution of measurements due to the measurement precision, as well as the distribution in the population, while the standerd error of the mean (SEM), is the measure of how precisely the mean in each population is determined, and is defined as SD divided by the square root of n, the number of study subjects. Thus, the fact that there are no overlap between the SEM ranges (or 95% confidence intervals), is mainly a function of study size, which the significance level is too (they actually measure more or less the same thing). Looking at the standard deviations (or 95% range) in the two populations as well, may tell something about the degree of overlap in the measurements, and may give some indications of the test's possible usefulness.



The sensitivity and specificity is a function of how well the populations are separated, but also of where cut off values are set. For any given test, there is the optimal combination of sensitivity and specificity.



Diagram showing the distribution of the healthy population and the population in series 2. There is an overlap between the population ranges shown by the blue line in the top diagram. The bottom diagram shows the sensitivity being the cumulative distribution of the sick population to the right of a cut off value) and specificity being the cumulative distribution of the healthy population to the left of the cut off value.
ROC curve diagram for the examples of series 2 and 3 above. The ROC curve is obtained by plotting the sensitivity against 1-specificity for all possible cut of values. The area uder the curve (AUC) is a measure of how good the test discriminates disease from normalcy The probability that the test will classify a sick patient higher than a healthy one). Series 2 shows a very good test, series 3, a more mediocre test. The diagonal in black, shows the ROC curve of flipping a coin, AUC is 50%.

It is evident from the curves above, that moving the cut off value to the left increases sensitivity (encompassing more of the sick patients) and decreases specificity (encompassing more of the healthy), while moving the cut off to the righ increases specificity at the cost of sensitivity. The diagnostic accuracy remains the same. The optimal combination of senstivity and specificity is where the lines cross.


The ROC curves are usefult to show two things:
  1. The area under the curve is a measure of the diagnostic accuracy of the test. The optimal test has an AUC of 100%, a curve corresponding to the blue edge of the diagram, thus encompassing all of the plot area, while an AUC of 50%, corresponding to the black diagonal curve, thus encomplassing half the plot area, means that the test has no diagnostic value whatsoever. (That is the diagnostic accuracy you obtain by flipping a coin). This method can be used for comparing different tests.
  2. Secondly, the optimal combination of sensitivity and specificity is the point where the angle of the curve changes most. (In this case that is on the diagonal, because the distributions are perfectly symmetric, which may not always be the case).

Senstivity and specificity are basically properties of the test itself, and the measurement's relation to the the distribution in the populations.

Positive and negative predictive value

The interpretation of the results, however, are dependent of the pretest probability of disease; i.e. the prevalence of the disease in the population that is examined.




The test is:

The disease is
Present
Absent
Sum
Positive
True positive (Tp)
False positive (Fp)
All positive (Tp + Fp)
Negative
False negative (Fn)
True Negative (Tn)
All negative (Fn + Tn)
Sum
All with condition (Tp + Fn)
All without condition (Fp + Tn)
All (Tp + Fp + Tn + Fn)
  • Positive predictive value is the probability of being ill given a positive test: PPV=True positive/all positive = Tp/(Tp+Fp)
    • Probability of being ill with a negative test is 1 - NPV
  • Negative predictive value is the probability of being healthy, having a negative test. NPV=True negative/all negative = Tn/(Tn+Fn)
    • Probability of being healthy with a positive test is 1 - PPV
Thus the positive and negative value of a test are not properties of a test alone, but also of the prevalence of the disease in the population where the test is applied:


Positive and negative predictive values shown as functions of the prevalence of disease, for two tests, one with both sensitivity and specificity of 90% and one with both 80%, respectively. It is evident that the separation of ill and healthy is best in the intermediate prevalence range, irrespective of the sensitivity and specificity of a test, but that the sensitivity and specificity determines how well the test will discriminate, given a specific prevalence. At the ends of the prevalence range, the probability of being healthy, despite a positive test, or the probability of being ill, despite a negative test, respectively, remains high.



This means that for a given test with sensitivity and specificity of 90%:

Prevalence (= pretest PPV): 90%
50%
10%
1%
Pretest NPV:
10%
50%
90%
99%

Ill Healthy Sum
Ill
Healthy
Sum
Ill Healthy Sum
Ill Healthy Sum
Positive:
81%
1%
82%
45%
5%
50%
9%
9%
18%
0.9%
9.9%
10%
Negative:
9%
9%
18%
5%
45%
50%
1%
81%
82%
0.1%
89.1%
90%
Sum:
90%
10%
100%
50%
50%
100%
10% 90%
100%
1%
99%
100%





PPV = Tp/(Tp+Fp):
99%
90%
50%
8%
NPV = Tn/(Tn+Fn):
50%
90%
98.7%
99.8%

We see that it is in the intermediate probability the test is most useful.
  • With a prevalence of 90%, there is already a 90% probability that the patient is ill, and the increase to 99% doesn't add much information. There is also still a 50% probability that the patient is ill, despite a negative test, thus the test will not rule out disease.
  • With a prevalence of 50%, the test is really useful, increasing both negative and positive predictive value from 50% to 90%, thus it is effective both in ruling out and ruling in disease.
  • With a prevalence of 10%, the test again has limited usefulness,
    • as the NPV only increases from 90 to 98.7%, the test is not very useful in ruling out disease that was not very probable in the first place,
    • as the PPV only increases to 50%, it is not very useful in ruling in disease, however, in this case, used as a first, or screening test, it can form the basis for a second test, where the selected population now in in the intermediate probability range. Thus, the test can be useful if it is cheap and safe, to increase the pretest probability before a second, more expensive test, or one with a higher complication risk (e.g. exercise ECG before coronary angiography).
  • With a prevalence of 1%, the test is practically useless, again the near certainty of being healthy (NPV), (99.8%) was there before the test (99%), and the PPV of 8% does not rule in disease, and in addition the number of false positives outnumber the number of true positives by a factor of 11, thus making the interpretation of test results even more uncertain, in addition to increasing the number of follow up tests, making a diagnostic strategy in such a population very costly.
Thus, negative and positive predictive value are measures of how the test performs in a certain population with a specific prevalenge. All tests perform best in the intermediate prevalence range, but NPV and PPV are also dependent on the sensitivity and specificity.



Phases in diagnostic research

There is  no general consensus about how to assess diagnostic tests (308), although a system of phases in research comparable to those in clinical trials has been suggested . This would greatly help in assessing how a test is coming along in development. However, there is not general consensus about what the phases should be (308, 311).

It has been suggested that phase I should be the establishment of the normal range (308). This may be feasible in biochemistry, where tests are fairly cheap. But in medical imaging, the cost of running a large group of healthy subjects through the imaging process to establish the normal range, before it is even clear whether the test has satisfactory discriminatory ability to be of value seems to be contra productive for research, so this approach is practically useless, and almost never done in imaging research.

However, medical history is full of instances where imaging has first been applied exclusively to patients with symptoms, where findings has been taken as explanation of the symptoms, and where later studies in normals have shown that the findings are present normally, and are neither patological nor associated with symptoms, as exemplified by the history of MR findings in whiplash (313, 314). Thus, at a point in time, normal studies should be done.

Sackett and Haynes (311) has set out a series of questions to define each phase (for diagnostics in general not only imaging):

This is a fairly reasonable approach, that applies to most tests and imaging methods.

Pre clinical studies

Are thus the initial:

Phase I studies

(Do test results in patients with the target disorder differ from those in normal people?)

Typically studies in patients where the diagnosis is clearly established, as compared to subjects who definitely don't have the diagnosis.

This is manly a study of feasibility and clinical validation (against clinical diagnosis - may be a reference method, or the total diagnosis). Establishing the validity of a new method compared to a reference method (e.g. infarcts vs. LGE MR, or Stress echo vs. coronary angiography), and how many patients or segments it can be used on. The study will answer the phase 1 question, but not very much else. The separation of means in it self do not tell enough about the test's usefulness, both the measurement precision as well as the distribution of measurements will decide this. The standard deviation (SD) tells about the distribution of measurements due to the measurement precision, as well as the distribution in the population, while the standerd error of the mean (SEM), is the measure of how precisely the mean in each population is determined, and is defined as SD divided by the square root of n, the number of study subjects. Thus, the fact that there are no overlap between the SEM ranges (or 95% confidence intervals), is mainly a function of study size, which the significance level is too (they actually measure more or less the same thing). Looking at the standard deviations (or 95% range) in the two populations as well, may tell something about the degree of overlap in the measurements, and may give some indications of the test's possible usefulness. Doing test retest, in the study, will give a measure of the measurement precision (reliability), and will indicate how much of the variation that is biological and how much is due to measurement.


In this phase, however, the patients are very selected, and the variability of measurements may be less (limited in other factors that may affect measures as age or fillingpressure), due to a limited number of participants and inimaging apparent precision may also be higher due to patient selection for image quality. Thus, the separation between groups may be higher than in a less selected population, at least less than in a group that are merely suspected of having the disease. Thus, results from the differences between those groups cannot be applied to patients who are merely suspected of having the diagnosis.

Phase I studies can establish initial cut off values (for quantitative measurements), or diagnostic criteria for visual image analysis, but only as hypotheses.

Phase II studies

(Are patients with certain test results more likely to have the target disorder than patients with other test results?)

Phase II studies should preferably be done prospectively. A quasi prospective design is possible, of course, if test results or images are stored in a database and then approached in a blinded manner. However, this will often mean a more selcted population, and selected on the basis of diagnosis as in phase I. Very often studies are designed to attept to do both phase I and phase II, however that will over estimate the true sensitivity and specificity in comparison with a prospective study.

Phase II studies seek to establish an estimate of the senstivity and specificity of the test, by comparing it both with a gold standard (which may be diagnosis by full clinical evaluation, or a reference method), as well as the diagnostic value of the method compared to other methods (e.g. global strain compared to WMSI and EF in heart failure. The ROC analysis will give the diagnostic value as AUC, and the optimal cut off values for highest sensitivity and specificity. Many studies then applies the values for senstivity and specificity found in the ROC analysis as the study end points. This, however, violastes the statistical independence, the ROC analysis should be applied to one part of the study sample, while the resulting senstivity and specificity should be applied to a second part of the study sample. As cut offs are established, the true and false positives and negatives are calculated (necessary for sensitivity and specificity), of course the negative and positive predictive value can also be calculated. The resulting PPV and NPV, however, have no bearing on the application of the test clinically, as this is not a target patient population.

Also, phase I and II studies are generally small and limited, as well as being single centre. This will have bearing on:
Phase I and II are still mainly hypothesis generating, about the clinical utility of a method, and about the measurement cut off values. Very often in phase II, the control group is super normal. When the test is applied to a less specified population, the range of values in the control population may increase. This affects sensitivity and specificity of the test as well (increasing both false positive and false negative proportion will reduce both), not only the NPV and PPV.

Ideally, in phase II, a first comparison with existing methods should be done. However, it's a sad fact that there is a trend to do little comparison,  just research into what is new and "sexy".


Basically, at this stage, it should be determined whether a new method is worth pursuing. although new methods may give different information, even if performing only similar to existing methods concerning the primary diagnostic target. This is for instance the case with strain rate imaging, although the Wall Motion Score in some studies performs as well as strain rate / strain measures (205, 219), strain rate gives new information showing more about the pathophysiology of both regional systole and diastole, and thus adds more information in total, for an integrated echocardiographic examination.

On the other hand. Initial phase II studies may show superior performance of a new method compared to an older method. However, looking closely at the difference, it is often not the new method than performs better than the old method did in previous studies, but rather that the older method performs poorer than it did i earlier studies.

The causes of this are various,
Thus the larger unselected prospective trials in phase III, may be more definite in comparing methods. Even in phase III, however, there may be a bias towards favoring newer methods over older, as in phase II. S tudies should thus always be compared to older studies, to see if there is a real increase in accuracy, or simply a drop in the accuracy of older methods.

It is a well known phenomenon of medical litterature that as new methods are investigated, they tend to show better results than older methods, not so much because newer methods have better accuracy, as because older methods perform poorer in new studies.



Hmm; do I smell something fishy? Blowing Humpback whale at the point of diving.



A classical example is the studies of B-mode stress echocardiography, where multi centre study study showed a sensitivity and specificity of stress echocardiography of 76 and 87%, respectively, and an inter institutional agreement of kappa 0.37 in the era of fundamental imaging (103). In a newer study with harmonic imaging, the senitivity increased to 92% with harmonic imaging, but the sensitivity with fundamental imaging dropped to 64% (104). Also, inter institutional agreement showed a kappa of 0.55 with harmonic, vs 0.49 with fundamental imaging (105).

In this case, however, the findings may be reasonable, seeing that harmonic imaging not only improves the quality of the images, but actually lifts some of the patients with poorest window up to a quality level where images actually may be interpreted. And this may mean that the second round of studies actually had more patients with poor image quality, thus a real reduction in the performance of fundamental imaging.


Phase III studies

IIIa: The phase III question according to Sackett: (Does the test result distinguish patients with and with­ out the target disorder among patients in whom it is clinically reasonable to suspect that the disease is present?) This question has direct bearing on the positive and negative predictive value, as those being a function of the pre test probability of the diagnosis. The patient population should be the population where the method is likely to be employed, the patients being tested should be unselected, and included prospectively. This will be a real life situation.

The study strength of a phase III study can be estimated from phase I and II studies, and the study size should be pre planned accordingly.

Thus, the feature distinguishing phase III from phase II is the application to a population where the diagnosis is not established, resembling the population where the method is to be applied for diagnosis.


In general this should be done:

Study end points should be the relevant clinical diagnosis, infarct vs non infarct, occluded vs. non occluded IRA etc, evaluated by AUC and sensitivity and specificity. Also, even a diagnostic phase III study may be limited to a subset of the actual population, missing out on parts of the population so the results are less valid in a larger group.

This was evident for the ratio of E/e'used for assessment of left atrial pressure (71, 72, 177), where the last study might be seen as a phase III study. However, as this was extended by normal studies (165, 166), the ratio was shown to be age dependent, thus limiting the cut off value. And finally a larger study with less selected patients showed less correlation with atrial pressure than earlier studies (272), probaly due to including more patients with bundle branch block or RV pacing.

To go from evidence level B to A, thus means that the results should be confirmed in more than one study.




However, this is only part of phase III.

Technology driven implementation of new methods, especially in imaging, less in clinical chemistry, will lead to increased use over time, both due to newer methods being more sensotove or specific, being less invasive (f.1) CT angiography vs invasive angiography), and in the later stages reduced cost, and better availability due to increased distribution of thechnology (like MR). This means that population distribution may shift over time, and this will affect the positive and negative predictive value of the test, and even sensitivity and specificity, if original cut off values are maintained.



IIIb:

In phase III the prognostic value of the findings should be established. This again needs a prospective design, following a large patient cohort, and comparing test results with clinical events. In this phase, the diagnostic value of the test (IIIa) can also be considered to be established.

It should be remarked, however, that is disease definitions change, some evidence may be invalidated, or at least

However, even if the prognostic predictive value of a test is established, this needs not establish the fact that having the test itself improves prognosis (Phase IV).



IIIc: Finally, phase III studies should then also include the larger normal range studies. As mentioned above, applying imaging before one knows the spectrum of normal variation, is fairly dangerous, and may lead to over diagnosis. Attepts to explain symptoms with imaging findings without knowing how many subjects without symptoms that have the same findings, is methodological trash. Even so, this is done both in for instance joint MR and arthroscopy.

This is also relevant to see how much the normal population differs from the patient population, regarding cut off values as illustrated above. This is also the ultimate test of the results from diagnostic phase III studies, how many false positives will you see in a defined healthy population. This means that in examining undifferentiated patient, the normal limits should be considered. Many large echo studies, however, do not have the design necessary to ascertain this kind of normal limits. Some studies include unselected populations in the study, but this means that the studies will not give new information about normal variation, as normalcy have to be defined by previously established cut off values.

However, normality must then be defined by criteria that are independent of the test being studied.

Phase IV studies

(Do patients who undergo this diagnostic test fare bet­ter (in their ultimate health outcomes) than similar patients who are not tested?) This means outcome studies. In diagnostics, this is phase IV instead of phase III, diagnostic outcome being the phase III question, clinical outcome then being phase IV.

Pure outcome studies of diagnostics are fairly rare. They are beyond the scope of RCTs, as the randomisation to using a diagnostic test or not, after the diagnostic value have been established at phase III level, will be problematical. Also, the long follow up time, makes phase IV more suited to registry studies.

However, there is also some fields where diagnostics and therapeutic decisions are interwoven. If the measurement becomes the basis for a treatment that is then evaluated in a phase III intervention trial showing benefit in terms of clinical outcome, the whole investigation-intervention sequence is the subject of the trial.
Even if guidelines do not specifically mention imaging as a recommendation, of course recommendation of an intervention based on imaging or measurement presupposes diagnosis, and both strength of recommendation and the evidence level should be the same, if the guideline is consistent.Thus, this can be said to constitute phase IV evidence for the diagnostic method, as has been shown for for instance beta blockers and ACE blockers in dilated heart failure (diagnosed by EF measurement), and CRT in dilated heart failure with LBBB (diagnosed by EF, but so far not by deformation imaging as described above).

However, phase IV should, like intervention, also include some kind of post marketing surveillance. There is always a trend toward increased use of diagnostic tests in lower probability populations, where they do not improve outcome, and may even contribute to adverse events, due to the high number of false positives that may result in further investigations (or at worst interventions to be at the safe side), with more invasive methods that have side effects. Thus, guidelines should also be established based on evidence, to try to define the limits for use of such methods in different clinical settings (317, 318).

Technology driven diagnostics

Basically, any diagnostic method implies applying some kind of technology. This means, that at some point in time, the technology was a specific addition to the methods of
The first interesting thing is that (apart from a very few clinical indices in some narrowly selected fields),

history and clinical examination themselves have no scientific evidence base.

Still, this is the base from general knowledge and clinical experience, that is the basis for all diagnostics. And this is because the clinical examination is the basis for selecting patients for extended diagnostic examination. Basically, it is used to assess the probability of the presence or absence of disease, to achieve a pretest probability that will render the extended diagnostic examination useful in terms of negative or positive predictive value. Thus, the clinical conclusion is basically probabilistic.

Some of the simplest technologies have replaced clinical signs as diagnostic criteria. Hemoglobin measurement have replaced "paleness", oxygen saturation measurement have replaced "cyanosis", Doppler gradient have replaced auscultatory assessment of aortic stenosis severity etc. In these cases, the clinical signs have become looser probabilities of disease, while extended diagnostic examination have become the verification of disease. In some cases, they have become part of an extended clinical examination (like Hemoglobin at hospital admittance, Saturation in the intensive ward).

As diagnostic methods are more precise than clinical examination, disease are more often defined from these (329).

As development of imaging is technology driven, there will always be an interest in the newer and more "sexy" methods, even if the improvement in performance is marginal.





Lack of a normal reference, will then be a possible mechanism for over diagnosis, and in the next phase over treatment.

The medicine is of course not to curb technological development, but to acquire normal studies and to be cautious in using technology without proper phase III studies.







But if there is increased use of the test, even the prognostic value of having the test may be falsely considered positive, but this will be due to the changing spectrum of disease as it is used more widely.


Echocardiography

I don't have an ambition of reviewing the whole of the database for echocardiography, but will illustrate the field with some examples.

In echocardiography, the best evidence is about assessment of valvular heart disease, where the treatment decisions are all in some way or other related to echocardiographic measurement. In these cases, the outcome is linked to the treatment, so the recommendation is so obvious it is not even stated (315, 316). Echocardiography, of course has phase IV, evidence, and should be considered IA recommended.

A rather curious fact is that measurement of EF (though not necessarily by echocardiography) is implemented is guidelines for heart failure, echocardiography itself has recommendation class I level of evidence C (as has MR), but treatment recommendations based on EF measurement has IA recommendations in the same guideline (309, 310), despite the fact that without EF measurement, there could be no treatment recommendation. Thus, it seems that the recommendations are a little inconsistent.

Methodologically, this seems impossible, if the premise for a treatment is level C, this means that the intervention trials cannot give higher evidence level. Or, as discussed above, if the diagnostic is a necessary premise for the intervention, once the intervention trials establish evidence level A, this is actually about the whole diagnosis-treatment sequence, and the evidence level A is established even for the diagnostic procedure.And the improved prognosis in CHF associated with treatment intervention based on EF measurement, constitused phase IV evidence, with modifications as discussed below.


Most studies in echocardiography are about single measures, or combined indexes of a few measurements (like MPI or E/e'). Those are the components of echocardiography most suited to the design of quantitative studies, but it do not take into account the totality of information in an echo examination.

However, quantitative indices, both single measures and combined indices may perform differently depending on the rest of the pathophysiology. This means that initial reports that the index measures a certain aspect of cardiac function, may only be true in a selected population representing a subset of patients. For instance:

  1. EF, being one of the most widely used indices of ventricular function, describes decreasing ventricular systolic function in various degrees of ventricular dilation, and thus functions fairly well when only normals and patients with dilated heart failure are included in studies. (Even so, there are systolic measures that are more sensitive). In systolic function, ejection fraction still remains as a cut off criterion for intervention in many studies. However, as shown later, the EF is dependent on geometry, especially wall thickness (228). Thus, (normal) EF should be interpreted in terms of geometry, and not as a single criterion for normal systolic function. Reduced EF on the other hand, will stil be a criterion for reduced systolic function.
    1. As up to half of HF patients are now classified as heart failure with preserved ejection fraction (324), EF looses it's diagnostic value for heart failure in an unselected HF population, as it necessarily do not identify the patients with heart failure with preserved ejection fraction. (This is obvious, and do not need to be proven in studies). 
    2. It also has lost it's prognostic value for the unselected HF population, showing no prognostic predictive value (36, 192, 227), nor with functional capacity.
    3. As EF still is used as cut off for intervention, this means that EF has phase IV evidence, but with newer knowledge, this is limited to dilated heart failure, and normal EF has less diagnostic, prognostic o discriminatory value. This also means that treatment outcome for HFpEF is lacking totally.
  2. E/A ratio as well as IVRT are measures of diastolic function, being reduced with reduced diastolic function, as shown in large studies (phase III).
    1. Normal function, however is age dependent, being reduced with age, However, this is still showing the reduced relaxation rate that occurs with age, but discriminary values have to be age adjusted.
    2. The E/A ratio of reduced diastolic function however, is (pseudo-) normalised with increasing atrial pressure, and thus only works in selected study populations without increased atrial pressure. The same is true of myocardial performance index (MPI)
  3. Then to adjust for this, the mitral annulus early diastolic velocity reflects the relaxation rat, with less load dependency. Thus, the  E/e' ratio would be related to atrial pressure,  meaning that if E increases and e' do not, basically the increase in E has to be pressure driven, and the increase in E/e' reflects the increase in atrial pressure. However, this only works one way. Increase in E/e' due to decrease in e', do not reflect change in atrial pressure. This is the case when subjects go from supine to sitting (29, 160) where atrial pressure drops, while E/e' increases, in LBBB where e' decreases in the septum (and synchronicity between E and e' is lost) due to PSS, and in low stroke volume where e' decreases as well. On the other hand, constriction mayshow high e', and thus normal E/e' in high atrial pressures (273). Thus again interpretation of the index must be modified by other echo and clinical findings.
  4. Aortic stenosis is evaluated by stenosis mean gradient and stenosis valve area, however, both are stroke volume dependent, meaning that in conditions with low stroke volume, the diagnostic value is low (although in these cases, the discrepancy between the indices gives the clue).
  5. In strain rate imaging this is evident as increased shortening in non infarcted segments in an infarcted ventricle is due to load reduction through segment interaction, not hypercontractility. Thus, it measures contractility when it is regionally reduced, but not when it is increased. Post systolic shortening may be a marker of ischemia, but not where there is LBBB, and may be absent in global ischemia, due to lack of normal segments.


Thus, evaluation of echocardiography, including strain rate imaging is dependent on an integrated evaluation of all available information to assess the total situation, and the meaning of single measures, and most studies only extract parts of the information in the form of single measurements or indices for evaluation. A full echocardiographic assessment, is thus not feasible for testing in studies. Registry studies, of course, may be used for outcome studies of echocardiography in large populations. 

This is reflected in the guidelines for echocardiographic examination, which do not contain neither class of recommendation, nor level of evidence assessment (146).

Most echo studies are also fairly small, and single centre. This is partly an economic issue, they are not driven by industry, but by academic institutions, as well as a practical issue, the number of measurements needs to be limited. (The limited number of cases increases the amount of data that can be handled per patient, but limits the transferability of the results). This means, that for newer indices, they are mostly phase I and II studies, and evidence level C.

An example of a phase I study in general echo, is the following:

In the study of Thorstensen et al. (154), the reproducibility (inter observer, repeated acquisitions as well as analysis) of the different global measures was as follows: 


Mean value (repeated measurements)
Coefficient of repetition (2SD of difference between repeated measures)
Mean error (% of mean)
EF (biplane Simpson; % pts)
59
7
10
MAE (by M-mode; mm) 17
1.6
4
S' (pwTDI; cm/s) 9.1
1.7
8




Global strain (2DS; % points)
-21
2
6
Global strain (Averaged from segm strain by combined ST and TDI; % pts)
-19
2
4
Global strain rate (2DS; s-1) -1.1
0.2
10
Global strain rate (Averaged from segm strain by combined ST and TDI; s-1) -1.2
0.2
8

The coefficient of repetition is the value obtained by Bland Altmann analysis, and also represents the lowest significant difference between two measurments (f.i. repeated in one single patients). No method did show correlation of error and mean, but the mean error gives the error in percent of mean, so as to be comparable between methods giving different units. The overall ANOVA signficance of the differences in mean error was p=0.001, indicating that at least the differences between 10% mean error (EF and 2DS-strain rate) and 4% (MAE and global strain from the segmental method) are significant.


The  reproducibility of single point measurments are thus far less, as shown previously (40) where both MAE and S' increased reproducibility on the order of 25%, compared to sigle point measurements. In the study by Thorstensen, The inter observer resproducibility of pwS' was as follows:

Mean value (repeated measurements) Coefficient of repetition (2SD of difference between repeated measures) Mean error (% of mean)
S' (pwTDI; mean of 4 points)
9.1
1.7
8
S' (pwTDI; mean of septal and lateral; cm/s) 9.2
2.3
11
S' (pwTDI; septal; cm/s)
8.4
2.9
13
S' (pwTDI; lateral; cm/s) 10.1
2.1
9
S' (pwTDI; inferior; cm/s) 8.7
2.8
15
S' (pwTDI; anterior; cm/s) 9.3
2.7
12
Variability of S' when taken as mean of four, two or single points. It is evident that taking the mean of septal and lateral gives the same mean value as four points, but the variability is higher, mean error 11 vs 8% (althpugh borderline significant p= 0.11).  However, the corrsponding reduction for e' was from 15 to 8%, (p<0.001). The overall reduction in mean error between one point and 4 poiont mean average was also significant.


Basically studies being small means that


Repeated studies that shows poorer performance due to less patient selection or simply different populations runs a higher risk of not being published, and thus the evidence base for meta analyses is limited.

Still, Phase III evidence in echocardiography has appeared, both diagnostic criteria, normal distribution, prognostic value for a lot of systolic and diastolic function parameters in echocardiography.

Echocardiography, however, is

Even if EF no longer is useful for diagnosing heart failure (only to classify, and assess degree within a subset of HF patients) Echocardiography, on the other hand, will be useful both in establishing diagnosis, assessing both systolic and diastolic function, as well as estimating if filling pressures are elevated. But the number of publications of other indices are few, and have not been implemented in clinical intervention studies.

Studies assessing qualitative assessment, is generally lacking, among other things because they need to be large (as well as not being quite as "sexy" as measurement of peak values, it seems).





Extracting useful material. Herring gull fishing for entrails in Hemne, Norway.

Clinical echocardiography is a complex method, consisting of scores of data, assessed either visually or by measurement. No measures have a precision of 100%, meaning that some measures will not be consistent with the others. Thus, no single measurement will be perfect and give the diagnosis, an echocardiographic examination is always using all the available data, weighing them against each other to decide which measurements to disregard. Clinical judgment is also needed to evaluate the meaning of abnormal findings, in terms of other conditions (like loading, hypertrophy or other). Thus, no single measurement is perfect, an echocardiographic examination always consists of using all the available, more or less imperfect data, weighing findings against each other and integrating information. This approach is for practical reasons nearly impossible to evaluate in a quantitative study, and approaches the more qualitative research of using background knowledge (from physiology/pathophysiology and case reports/case series).

However, this means that a thorough knowledge of the technology and limitations of echo as well as experience woth the method may be of use in interpreting findings.

With the limitations inherent in basic ultrasound and in the specific methods, clinical ultrasound will partly be a craft, not pure science, and a knowledge of the methods themselves and the method specific limitations is essential. in the manner of an extended clinical examination. In many ways clinical echocardiography can be considered an extended clinical examination, more than a specific imaging.

It may be important to remember that clinical examination and medical history, being the two main cornerstones of medicine, has no scientific evidence whatsoever, despite this being the main criterion for selecting patients for any diagnostic procedure.

The conclusion will usually be fairly certain in the hands of an experienced clinician, even if single measurements are not. This is a fundamental property of all echocardiography.
Deformation imaging, however, is no worse than other echo modalities, once a critical view to the limitations are applied but should be used together with the whole examination.




The evidence base for strain rate imaging

To start with:

For documentation of evidence, the main point is whether they give added diagnostic value, compared to basic echocardiography.
Thus most evidence still remains at the phase II stage.

Still, some phase III evidence has appeared, in the form of normal distribution, as well as prognostic studies.

Normal ranges for strain rate and strain: The HUNT echo study

In a recent population study, the north Tröndelag population (HUNT) study, 1266 subjects without known heart disease, hypertension and diabetes were randomly selected from the total study population of 49 827, and subjects with clinically significant findings on echocardiography (a total of only 30) were excluded. (153) This is the largest strain rate echocardiographic population study from a defined normal population. End systolic strain and peak systolic strain rate was measured by the combined tissue Doppler / speckle tracking segmental strain application of the Norwegian University of Science and Technology, but the results were compared to other methods in a subset of subjects, showing small differences.

The poulation had the following characteristics:

The study consisted of  673 women with a mean BP of 127/71 ,mean age of 47,3 years and BMI of 25.8 and 623 men, with mean BP of 133/77, mean age of 50.6 and BMI of 26.5. Both sexes were normally distibuted with an SD of 13.6 and 13.7 years, respectively. 20% of both sexes were current smokers.

Ordinary echo findings were (165):
Mean
Female
Male
IVSd (mm)
8.1
9.5
LVIDd (mm)
49
53
LVPWd (mm) 8.2
9.6
FS (%)
36
36
Mitral E (cm/s)
75
66
Dec-T (ms)
218
238
IVRT (ms)
93
103

These findings are in accordance with other studies, like the findings of Schirmer et al (156, 157), so the study population may be assumed to be representative.

Normal values for systolic velocities of the right and left ventricle from the HUNT study (165).

Left ventricle, mean of 4 walls
Right ventricle (free wall)

S' (pw TDI)
S' cTDI
S' (pwTDI)
Females



< 40 years
8.9 (1.1)
7.2 ( 1.0)
13.0 (1.8)
40 - 60 years
8.1 (1.2)
6.5 (1.0)
12.4 (1.9)
> 60 years
7.2 (1.2)
5.7 (1.1)
11.8 (2.0)
All
8.2 (1.3)
6.6 (1.1)
12.5 (1.9)
Males



< 40 years
9.4 (1.4)
7.6 (1.2)
13.2 (2.0)
40 - 60 years
8.6 (1.3)
6.9 (1.3)
12.8 (2.2)
> 60 years
8.0 (1.3)
6.4 (1.2)
12.5 (2.3)
All
8.6 (1.4)
6.9 (1.3)
12.8 (2.2)
Annular velocities by sex and age. Values are mean (SD).  pwTDI: Pulsed Tissue Doppler recorded at the top of the spectrum with minimum gain, c TDI: colour TDI.  Normal range is customary defined as mean ± 2 SD




Normal values for left ventricular strain and strain rate from the HUNT study (153)

Female
Male

End systolic strain (%)
Peak systolic strain rate
End systolic strain Peak systolic strain rate
< 40 years
-17.9% (2.1)
-1.09s-1 (0.12)
-16.8% (2.0)
-1.06s-1 (0.13)
40 - 60 years
-17.6% (2.1)
-1.06s-1 (0.13) -18.8% (2.2)
-1.01s-1 (0.12)
> 60 years
-15.9% (2.4)
-0.97s-1 (0.14) -15.5% (2.4)
-0.97s-1 (0.14)
Over all
-17.4% (2.3)
-1.05s-1 (0.13) -15.9% (2.3)
-1.01s-1 (0.13)
  Values are given as mean ( SD). The customary definition of normal values as mean ± 2SD, giving about 95% of the normal population, results in wider normal limits than previously shown as cut off values in small patient studies. The values were normally distributed, and with no clinically significant differences between levels or walls. Values decline with age, as does the velocity.


The study showed strain and strain rate to be normally distributed:

Normal distribution of strain and strain rate.

This shows an important point in evaluating studies, namely the influence of technology. The HUNT study showed nearly perfect normal distribution.

Looking at different walls, it is evident that despite there being differences between systolic velocities of the annulus (which might be taken as the sum of all strain rates along the wall),  there are very little difference between walls:



Anteroseptal
Anterior
(Antero-)lateral
Inferolateral
Inferior
(Infero-)septal
PwTDI S' (cm/s)

8.3 (1.9) 8.8 (1.8)

8.6 (1.4)
8.0 (1.2)
cTDI S' (cm/s)

6.5 (1.4)
7.0 (1.8)

6.9 (1.4)
6.3 (1.2)
SR (s-1)
-0.99 (0.27) -1.02 (0.28)
-1.05 (0.28)
-1.07 (0.27)
-1.03 (0.26)
-1.01 (0.25)
Strain (%)
-16.0 (4.1) -16.8 (4.3)
-16.6 (4.1)
-16.5 (4.1)
-17.0 (4.0)
-16.8 (4.0)
Results from the HUNT study (153, 165) with normal values based on 1266 healthy individuals. Values are mean values (SD in parentheses).  Velocities are taken from the four points on the mitral annulus in four chamber and two chamber views, while deformation parameters are measured in 16 segments, and averaged per wall.  The differences between walls are seen to be smaller in deformation parameters than in motion parameters, although still significant due to the large numbers.

This is due to the walls with the higher velocities being longer, i.e. velocity difference per length remains the same, as discussed here.

Another, smaller study (152) did show higher mean values, in the distribution of both strain and strain rate in the healthy subset (236 of 480). The standard deviations, however, were wider.


The HUNT study used segmental strain by combined speckle tracking and tissue Doppler. This is sensitive to the presence of clutter, but only in the segment borders, where tracking kernels are placed. If there is clutter, the kernel will not track, and it will affect both segments bordering on that kernel, as explained here. The segments that bordered on that segment, were discarded, and since the aim of the study was to establish normal values, a high discard percent would ensure representative values. THe discard rate of 40% of segments should thus be considered a strangth, not a weakness of the results. Also, the segmental method is far less sensitive to random noise, which is a problem with tissue Doppler derived strain rate.

The smaller study used a tracking velocity gradient by tissue Doppler. This is far more sensitive to random noise, which may tend to increase peak values of strain rate, dependent on the degree of temporal smoothing, although the use of cine compound of three cycles will reduce this. Hoever, the velocity gradient will also be sensitive to clutter. In the actual study, sample volume was only placed in the basal part of two walls. The presence of clutter in the upper end of the sample volume will tend to increase strain rate values. This might introduce a systematic bias, dependent on the discard rate, which I cannot find to be reported. Also, the strain was reported as Eulerian ("natural") strain, which has higher absolute values (for shortening) than the more customary Lagrangian strain, although this do not explain the full difference.

Thus, the knowledge of methods is essential for interpreting the evidence.

Another study, Using speckle tracking 2D strain (207) found normal values closer to those found in HUNT, but also found a base to apex gradient in strain. However, this may be another artefact, the curvature dependency of 2D strain.

IN the HUNT study, we did comparison of the methods:



Method 1: segment length by TDI and ST
Method 2: Velocity gradient (stationary ROI)
Method 3: Dynamic velocity gradient (tracked ROI)
Method 4: 2D strain (AFI)

Peak Strain rate
End systolic Strain
Peak Strain rate End systolic Strain Peak Strain rate End systolic Strain Peak Strain rate End systolic Strain
Apical -1.12 (0.27)
-18.0 (3.6)
-1.46 (0.85)
-14.6 (9.0)
-1.31 (0.73)
-17.2 (9.1)
-1.12 (0.37)
-18.7 (6.6)
Midwall
-1.08 (0.22)
-17.2 (3.2)
-1.29 (0.56)
-18.2 (7.4)
-1.40 (0.58)
-16.9 (7.1)
-0.99 (0.23)
-18.3 (4.7)
Basal
-1.03 (0.24)
-17.2 (3.5)
-1.71 (0.94)
-19.6 (9.3)
-1.59 (0.74)
-17.1 (8.6)
-1.12 (0.36)
-18.0 (6.2)
Mean
-1.08 (0.25
-17.4 (3.4)
-1.45 (0.79)
-17.7 (8.5)
-1.43 (0.67)
-16.7 (8.1)
-1.07 (0.33)
-18.4 (5.9)
Comparison between methods. Standard deviations in parentheses.


The table shows:
  • Much higher (absolute) strain rate values by velocity gradient, than with speckle tracking and segmental strain (which are more comparable). This may thus be a function of noise. Noise is eliminated by integration to strain (in fact this is a temporal smoothing), and thus strain values are far more comparable.
  • Not tracking the apical segment results in lower values here, probably due to an angle problem.
  • Lower values than in the previous strain / strain rate study with tissue Doppler. In this case, the rejection of segments is driven by the segment method, as only segments feasible in all methods were included.
  • No base to apex gradient in the 2D strain values seen here. In this analysis, care was taken to achieve as little apical curvature as possible.


And, of course, reproducibility of the strain rate and strain measurements should be part of it:

It must be emphasized that while global measurements are averages over the whole ventricle, and thus more robust and reproducible, the segmental measures are from one segment only, and thus has a higher variability. (Unless one applies smoothing,, of course, but smoothing is indiscriminatory, and will smoothe away real differences just as much as differences due to method variability. We compared the reproducibility of single segment measurements and global averages (154):


Measurement (units of measurement)
Mean
CoR
Mean error
Segmental method
Global strain (% points) -19
2
4

Global strain rate (s-1)
-1.2
0.2
8

segmental strain (% points) -19
8
18

segmental strain rate (s-1) -1.2
0.5
16
2D strain Global strain (% points) -21
2
6

Global strain rate (s-1) -1.1
0.2
10

segmental strain (% points) -21
7
14

segmental strain rate (s-1) -1.1
0.5
17
Mean: mean of repeated measurements. CoR: Coefficient of repetition (= 2SD of difference of repeated measurement). Mean error: difference between repeated measurements in % of mean. The variability of segmental values was 2 - 3 times higher, compared to global values (p<0.001), but for regional measures, there  was no difference between 2Dstrain and segmental strain by combined ST and TDI

This, however, is smaller phase II study on normals, and thus also over a limited range, but being part of the larger HUNT phase III study.

The main point so far is that there are large problems in doing meta analyses of strain and strain rate measurements, although other echo measures will be more standardised (326), as both technology as well as analysis technique will vary.
So far, it seems that standardisation should be the first step (287), although limiting the range of methods may be a bad idea at present.

Also, it means that peak values do not have validity across different technological platforms, and the most universal (but less documented approach is the qualitative assessment of colour and curve forms, as I have discussed above.


Global function measures:


Longitudinal shortening has long been established as the best measures of global function.

Ejection fraction is limited in concentric geometry, where it may be normal even in severe heart failure, and wall motion score index (WMSI) is limited in only working where there is regional inequalities in function. There it functions as a global measure, having additional information over ef  (189). However, in symmetrical dilated cardiomyopathy the WMSI will be 2, irrespective of the EF, and here it doesn't function.

Long axis shortening of the left ventricle on the other hand , will be useful in all situations, both diagnostic and prognostic (30 -  37, 40, 56, 59, 60, 64 - 67, 116, 150, 190, 191, 192, 193, 204). Systolic shortening rate has likewise been well established as a global systolic measure, measured by the mean peak systolic velocity of the mitral annular plane(37, 38, 39, 40,202, 203, 204).

Global strain (peak- or end systolic) and (peak systolic) strain rate can be measured either as an average of segmental values, or as a normalisation of global values (from annulus measurements, in various ways, or directly by composite speckle tracking methods. In speckle tracking, there is often some kind of spline (or similar) smoothing along the ROI, often making use of the annular plane motion as a major input. Even though this might mask regional inequalities, it may make the global measure more robust. However, values are different by different vendors (277, 278, 327, 328), and the main source of variability seems to be the analysis software. Some of the reasons for this is explained here. Also, the use of peak vs end systolic strain may cause some variability, as will the definition of end systole by the analysis software. There is still a need for standardisation (287).

Global strain rate, has been less used, probably because the variability of segmental peak systolic strain rate.

Thus, despite the emerging evidence for the diagnostic and prognostic value of global strain, it must be emphasized:

  1. The evidence for the value of global strain over MAPSE is not very established, except for children. It may be that the variability due to disease is larger than the variability due to differences in body size. 
  2. The vendor dependency of measurement values means that normal values are only valid within one system, and that follow up of single patients should be restricted to one system.
  3. serial measures of MAPSE or S' in single patients are thus as little prone to variability, and the need for actually normalising for size is absent (again, excepting growing children)

Thus, guidelines recommending using global strain, are still somewhat far off the real world, except for serial measurements by a single system.

Still, global strain shows both diagnostic and prognostic value.  This compensates for the shortcomings of ejection fraction,  and is more sensitive (149, 150, 159), even in coronary disease where there is normal to dilated ventricles.

Wall motion score index (WMSI), being the  average of wall motion score  of all evaluable segments becomes a measure of  global function, and has been shown to correlate with EF in infarcted ventricles (40). For global measure, the main emphasis in publications has been on global strain, especially by the 2D strain application. Basically, global functional measurements should be expected to show reduced function, with increased infarct size. This is shown for WMSI as well as EF, and also that they correlate (40). However, global strain may be obtained by any method by averaging the segmental values, which will give reduced variability as well as shown by Thorsensen et al. (154). There  was no difference between 2Dstrain and segmental strain by combined ST and TDI, for global measures. With MR, there is a reliable method for quantitating infarct size independently, enzyme markers being unreliable, relating as much to the degree and timing of reperfusion as to infarct size. 
























 
This has been less well documented. From a puristic point of view, only clear documentation of added clincal value to B-mode based WMS is evidence for the utility of deformastion imaging. This will be reviewed more in detail below.



If new methods prove to be better diagnostically than established methods, this is proof enough of added value.  If this is not established , for instance in prognosis, the value of adding a method tho others may be shown statistically, as for instance as done in (133).





In one study (188), there was an improvement of WMS in the infarcted segement from 2.7 to 2.2 (Decimals due to the averaging), While peak systolic strain rate improved from -0.24 s-1 to -1.2 s-1 in the segments with severely reduced function, and from -0.6 s-1 to -1.1 s-1 in the segments with moderately reduced function, showing strain rate to be a help in assessing recovery of function compared to WMSI.



Normal limits versus cut off values

When using numerical values the fundamental understanding that there may be a difference between the cut off values between normal and abnormal, established in studies, and the normal limits in a general healthy population as illustrated below.



These findings are in accordance with other studies, like the findings of Schirmer et al (156, 157), so the study population may be assumed to be representative.
The study showed strain and strain rate to be normally distributed.
The study shows little differnce between different levels (basal / midwall / apical), or between the different walls. Although some differences were statistically significant, the differences were so small as to be clinically insignificant.
The tables of normal values are given throughout the text:












This is relevant both in clinical practice, and also in studies. Given the high number of artefacts found in daily echo practice, studies reporting a very high feasibility may be prone to this effect. This means that not only do the studies overestimate the accuracy of deformation images, by just confirming the visual assessment, but the wall motion may in fact be the main sorce of information. However, studies reporting added information or increased accuracy in relation shows the added diagnostic value, as has been shown in some studies (128, 133). But basically a high discard rate ensures higher quality of the studies.












In diagnosis of regional dysfunction, however, the added value of deformation measurement has been less well documented.  This also needs to be established in clinical studies.

In this, last part, I will try to review the evidence from clinical trials for the different areas of application.

The clincal evidence may be dealt into phases, analoguous with the phases in clinical trials.




However, in interventional research in cardiology, there is almost always an element of imaging, and the requirements of modern imaging research regarding phase 3 (and 4) studies should be taken into account in the study design. Only adding to the total database will serve to improve the predictive value of echocardiography with introducing newer methods with better predictive value into the clinical practice. A glaring example is EF, there are far better predictors today, but none linked so massively to intervention. This results in EF still being used as indication for intervention, despite it's shortcomings.

Any new method, arriving in to phase 2, should always be comapred to older methods. The added value of a new method is basically depending of better accuracy (phase 2) or better predictive value (phase 3). As discussed above, even if a metod is only just as good as older methods, it may be an alternative in selected patients, as well as part of the total information in an intgegrated evaluation.




The causes of this are various, there may be a publication bias in favor of newer methods, the acquisition are often optimised in favor of the newer methods, and the patients selected for good image quality in newer methods. A classical example is the studies of stress echocardiography, where harmonic imaging increased sensitivity of stress echo, compared to fundamental imaging (104, 105), but the sensitivity of fundamental imaging decreased, compared to an earlier study by the same group (103). In this case, the reasonable explanation is that with fundamental imaging, more patients with moderate image quality are eligible for stress, while yielding poor results in fundamental imaging. Thus, even if new methods should be shown superior, this may not be the over all result in daily clinic. Clinical studies should also be viewed with this in mind.

Global functional systolic measurements

Ischemic heart disease is the main cause of regional dysfunction, and thus the main target of deformation imaging, however, deformation has led to the concept of normalised measures for global function as well, which may become increasingly important in the future.

Global functional measurements
by longitudinal annulus displacement and velocity, as well as global strain, have all been shown to be better discriminators and prognosticators than other measurements of global function (FS and EF). The annular displacement has been shown to be more sensitive than EF in predicting events in heart failure (36, 192) and hypertension (193), indicating that it is a more precise measure of systolic function, that the cavity measurements. This may be due to the shortcoming of EF in small ventricles / hypertrophy. It has also been shown to be a better correlate for infarct size than EF (150). Also, the MAE correlates better with BNP in heart failure, than the fractional shortening (204).

Thus, the MAE is a more all round useful measure of longitudinal function than EF.

In fact, the main reason for using EF at the present day, instead of the better long axis measures, is the weight of evidence. And alas, interventional studies using echocardiography as secondary outcome, persists in using only EF, instead of including newer measures for direct comparison of the ability in predicting clinical outcome as well as establishing cut off values for intervention.

The systolic peak annulus velocity S' has been shown to be sensitive for reduced function in relatives who are mutation positive, of patients with manifest hypertrophic cardiomyopathy, despite having normal EF and no hypertrophy (203). The diastolic functionby tissue Doppler was similarly decreased. It also correlates better with BNP in heart failure than the fractional shortening (204).

Thus, the peak systolic annular velocity is useful in that it is a better marker of systolic function, and that it offers a measure that allows direct comparison of systolic and diastolic function.

In addition,  This is credible, some of the variability in MAE will be due to differences in LV size, and normalising will remove this variability and give a tighter relation to pumping parameters normalised for body size, and thus a higher diagnostic discriminatory value. This probably has most importance when normal variations in body and heart size is biggest (as in children) and least importance where normal variability is lower, and variation between normal and pathological is great (as in dilated heart failure). None of the methods for normalisation, however, have established superiority. The Global strain, in one or other form, is likely to take flight in the near future.

Normalisation of velocities seem less established, systolic velocities are related to diastolic.



Ischemic heart disease

Ischemic heart disease is the main cuse of regional dysfunction, and thus the main target of deformation imaging, however, deformation has led to the concept of normalised measures for global function as well, which may become increasingly important in the future.

Myocardial infarction.

Ischemic heart disease is the main cause of regional dysfunction, and the first area for research. In fact, the B-mode wall motion score index, is the first assessment of regional function:

  1. Normal
  2. hypokinetic
  3. Akinetic
  4. Dyskinetic
However, analysing the wall motion by B-mode, the frame rate is higher than the temporal resolution by visual inspection (24). In order to achieve optimal temporal resolution, it is customary to stop the loop and scroll through it frame by frame. This usually shows segmental dyssynergy as failure to thicken during the first frames of each systole. But this again means that wall motion scoring is more dependent on early failure to thicken/shorten, i.e. the timing of onset of contraction, than on peak thickening/shortening (strain) or peak thickening/shortening rate (strain rate). This means that the methods in fact doesn't measure exactly the same thing.


Global measurements



Global strain was shown to correlate well with infarct size (R= 0.84), as compared to WMSI (R= -0.71) and EF by echocardiography (R= 0.58) (205). The diagnostic accuracy for infarcts => 30g by ROC analysis was AUC of 0.95 for global strain (giving a sensitivity of 0.83 and a specificity of 0.93 ), 0.90 for WMSI and 0.81 for EF.The study is hampered by not giving confidence intervals, thus not showing whether the differences between methods were significant. Another study by the same group did show a trend towards improved accuracy in infarct diagnosis by global strain, compared to MAE, EF and WMSI (150), but again without any confidence intervals. Also, normalising for infarct size did not seem to add to the value of MAE in adults. Perhaps because the range in LV size is small.

2D strain has been applied to early echo in NSTEMI. It has been shown that global longitudinal strain correlates with final infarct size (189), and was a far better predictor than EF, but so far, no better than WMSI.

Regional measurements




Thus, averaging more than one segment reduces variability. However, as discussed in the main section, segmental values cannot be averaged for a whole wall, due to the segment and AV plane interaction. However, there is possibility to average segments withing a vascular territory, as discussed below.

Unless there is definite proof that quantitative measurements is better than WMS, the alternative to quantitative measurements is simply to use the parametric WMS from strain rate, this reduces the information content, but also the variability. In addition, it will make it possible to measure timing of evennts as time to segmental onset of shortening, as well as time to onset of lengthening (which is an indication of post systolic shortening)(186).

Wall motion scoring by colour SRI:




Using curved M-mode colour display for wall motion scoring, WMS by colour strain rate imaging was shown to have fair correspondence with B-mode (6) kappa 0.45, weighted kappa of 0.63, to 0.64 (7). For repeated measurement, both inter- and intra observer, the weighted kappa was of the same order of magnitude, also for the WMS by B-mode. It was also shown to have similar accuracy in diagnosing regional coronary disease (7, 10) with a sensitivity of about 70%, a specificity of 90%, and an overall accuracy of about 84%. Interestingly, adding the two methods did not change the overall accuracy, indicating that the methods gave the same information (7). (NB: strain rate was evaluated unblinded to B-mode loops). Another study (10) concluded that both strain rate and strain could describe regional dysfunction in infarction, but the variability in this study is about the same as in the others, so the overlap in values between segments with different WMS is too great for the method to be clinically useful in the individual patient, although the numbers needed for significance in those studies was quite low (10 - 25 patients).  Strain by ultrasound did show a fair correspondence with strain by MR in another clinical validation study (9), showing fair correspondence, with no significant bias, but with limits of agreement about ± 7%, as compared to normal strain of 18% in controls and 15% in remote segments in infarction patients. In this study mean strain in infarct segments was 1-2%, showing that only akinetic segments was considered, and with a repetition coefficient of 7%, hypokinesia may be difficult to separate from normokinesia. A comparative study (40), of ring motion by M-mode and tissue Doppler vs segmental analysis by peak systolic strain rate showed that neither ring velocity nor displacement could identify the infarct site in terms of myocardial sector affected, while segmental analysis by strain rate could. However, this was also only by significance for group data. The interesting point was that mean strain rate of a sector could not identify the infarct site either, although segmental strain rate could, showing that infarct distribution is not limited to  discrete sectors.

In conclusion, parametric strain rate imaging seems to have a sensitivity and specificity comparable to grey scale imaging (about 85%), both in locating infarct segments and in semi quantitative analysis of wall motion. Repeatability also seems to be on the same magnitude. In quantitative analysis, careful post processing may give a sufficient precision for clinical work. One of the regions that are difficult to assess visually, is the basal inferior wall, as this region may have a lot of motion, but little actual contraction, as shown above.

Quantitative measurements

Quantitative measurements have shown better ability to discern infarcted from non infarcted segments, than velocities (40, 41). A study comparing segmental velocities to segmental strain rate (41),  concludes that peak systolic strain rate is superior to segmental peak systolic velocities in identifying infarcted segments, against M-SPECT fixed perfusion defects as reference. Sensitivity and specificity for recognition of infarct segments were 91% and 84% for colour SRI, 63% and 73% for colour DTI, 78% and 71% for B-mode echocardiography (WMS), and 87% and 77% for anatomic M-mode (AMM), respectively. Repeatability of evaluation of infarcted vs non infarcted segments was 0.85 with both colour DTI and colour SRI, but with higher sensitivity and specificity of SRI. Colour analysis was considered feasible in 100% of segments. The results are similar to previous studies. In quantitative analysis, peak SR was measurable in 84% of segments, while peak segmental velocity was feasible in 91%. Peak SRs correlated with wall-motion assessment by B-mode echocardiography better than peak velocities (R = .66 vs.10), with less overlap between groups, but still the study showed overlap between peak systolic strain rate in segments grouped by grey scale WMS. The variation (SD of  differences) were reported as 6 - 10% or 0.04 to 0.06s-1. This corresponds to a repetition coefficient of  0.10s-1, which is quite acceptable. This study was done by averaging measured values from three cycles. It would seem to add information by SRI to WMS.

Basically, segmental reduced function will not cause the ring to lag in part of the circumference, so much as the total ring motion will be reduced as a function of the reduced total shortening force (40). This may explain why the global strain is just as useful as regional strain in assessing the infarct size, due to the segment interaction and the interaction with the AV-plane.

The acute phase of myocardial infarction may be considered as an acute ischemic event. However, several studies has addressed the presence of acute ischemia in other settings. Kukulski et al (99) did a study during PCI, demonstrating a reduction in peak systolic velocities, strain rate and strain in both longitudinal  (LAD occlusion) and transmural (RCA/CX occlusion) direction. SR and strain had the highest sensitivity / specificity (75% / 80% and 80%, respectively) compared to 68% / 65% for velocity in identifying ischemia. In ROC analysis, the AUC was 0.62 for reduction in systolic velocities, 0.84 for strain rate and 0.82 for strain. The main implications of the study is that it demonstrates the difference in sensitivity of deformation vs. motion imaging, due to tethering effects. As strain rate is more noisy than velocity, the repetition coefficient may well be substantially higher, and the clinical value similar.

Quantitative measurements have also shown the ability to quantitate changes in regional function during the recovery phase of an acute infarction, showing that there is a rapid recovery already the first 1 - 3 days, less during the first week (92, 174, 188). One example is shown above.In the last study, there were little reduction (compared to normal) in global indices (including annular plane parameters and global SR and strain)  the first day, and subsequently little improvement during the first week. Infarct related segments, however, had a WMS of 2.7 the first day, improving to 2.4 the second and third and 2.2 the seventh. (Decimals due to the averaging.)

Contrary to this, train rate and strain improved most on day 2, less from day 2 to 7.



Day 1
Day 2
Day 7

Mean WMS in infarct related segments
2.7 (0.4)
2.4 (0.7)
2.2 (0.7)
Segments with severly depressed function
Strain rate (s-1)
-0.24 (0.2)
-0.92 (0.5)
-1.2 (0.4)

Strain (%)
-1.4 (1.7)
-11.6 (5.5)
-14.7 (6.5)
Segments with modeately depressed function
Strain rate (s-1) -0.6 (0.06)
-1.0 (0.3)
-1.1 (0.4)

Strain (%) -7.7 (1.1)
-14.7 (4.1)
-15.1 (1.1)
Standard deviations in parentheses.

From this, it's evident that peak strain rate and strain are near normalised during the first week, but not WMS. The authors argue that this shows that strain and strain rate are more sensitive to changes than WMS. This is true, insofar as the main purpose is to differentiate between stunning and necrosis. However, this need not mean that WMS is less sensitive, as argued above, WMS as analysed by early systole may reflect timing more than peak thickening, at the time course of recovery may be different between delayed onset and  peak contraction. But this difference will make peak systolic deformation the earliest predictor of functional recovery, and may be more useful in early assessment.




The presence of post systolic shortening (PSS) in acute myocardial infarction was observed by Jamal et al . (91) and might represent another diagnostic criterion. This was addressed in a longitudinal study (92) showing the presence of post systolic shortening in 60% of infarct segments (73% of mid infarct segments, but in all patients), 29% of the border zone segments and 5% of presumed non infarct segments.  The finding that the area of  PSS exceeds the area of hypokinesia was also observed in a study of 3D parametric imaging of myocardial infarction (22). PSS disappeared in virtually all border segments in one week, and half the infarct segments after 3 months. Thus PSS has neither the sensitivity nor the specificity of identifying infarcted segments, and the presumed ischemic border one also does show PSS, but it may be important in identifying acute ischemia, and in identifying infarct segments in combination with peak strain rate /strain. Post systolic shortening has been shown to be present in 30% of normal segments, but in those cases always in combination with normal systolic strain (97). The best cut off between normal and pathological PSS was considered post systolic strain > 2,5% absolute or 2=% of total strain. In patients with acute ischemia, PSS was present in 78% of ischemic segments and 40% of non ischemic segments, in scarred segments the percentages was about the same. The last finding contrasts with another study, where PSS was reduced both in magnitude and extent from the acute (1 day) to the chronic (3 months) phase of myocardial infarction (92).

The study of Kukulski et al (99) by post systolic shortening, the AUC was 0.67, 0.80 and 0.85, respectively, for increase in post systolic velocity or strain rate /strain, demonstrating the diagnostic value of post systolic shortening in ischemia. This study also showed the reversal of post systolic shortening after reperfusion. On the other hand reproducibility data are not given. The other main point of the study is that the presence of post systolic shortening  is established as an important marker of acute ischemia in a clinical setting being present after very few seconds. As the duration of ischemia is short during PCI, however, the reversibility may not be the same after prolonged ischemia (stunning) or myocardial infarction (92, 97). The clinical setting was such that it has more important bearing on the method and the pathophysiology than the actual clinical utility.

This is further developed in another paper by the same group (100), where the post systolic strain index is defined as PSI = (peak  strain - end systolic strain) / peak strain. As ischemia is shown to induce reduction in systolic strain as well as increase in post systolic strain, the combined index was shown to be more sensitive, AUC of 0.95 with a cut off value of 0.25 giving a sensitivity and specificity of 89%, as compared to 0.84 for end systolic strain alone (cut off -10%, sensitivity/specificity 86/83%). Repeatability is not given. Newer studies have fond lower accuracy of PSI ( 205). In stress echo, the post systolic index has not been shown unequivocally to give added diagnostic value (128).

However, the concept of post systolic shortening is basically an expression of inequalities in systolic relative load, and an indirect measure of delayed contraction.

Longitudinal strain and strain rate with tissue Doppler has been shown to correlate with the transmurality of infarction in myocardial segments (210, 220), with late enhanced MR as reference. Difference was shown to be significant with strain for segments with > 25% transmurality of scar, for strain rate with > 50% transmurality. However, overlap was present, and sensitivity and specificity data are not given. It is also doubtful, whether the fine division of transmurality into four categories is useful despite the findings of Kim (211). Longitudinal strain and strain rate is still a functional measurement, which may be more important than the anatomical information given by the MR. The findings of reduced strain with increasing transmurality has been repeated, although the diagnostic capability was no better than WMS ( 205). The difference in longitudinal strain between subendocardial and transmural infarcts by 2D strain was not shown in another study (221), thus it may still be a difference in sensitivity between the methods.

Analysing strain in terms of vascular territories, i.e. averaging the segmental values over a classical vascular territory, reduced the confidence intervals, showing an improved reproducibility by averaging segments. This is in line with the findings above, that reproducibility of segmental values is far poorer than global averages. This reduction in variability also improved accuracy (significantly??), both of strain and WMSI (205). In that study, however, the newer view of overlap of vascular territories (146), as shown above, has not been taken into account, which may limit the availability of segments to average, especially in circumflex.

2D strain

has also been shown to give quantitative information about the degree of myocardial loss. Global strain correlates well with infarct size as discussed above.

A recent study have compared tissue Doppler derived longitudinal strain with speckle tracking derived 2D strain in diagnosis of myocardial infarction (213). Strain did show a significant improvement from the acute phase to discharge, however, the significance of this finding was limited to tissue Doppler derived strain, while neither longitudinal nor circumferential 2D strain could shown significance. This may be an indication of lower sensitivity by this method, which may be related to the smoothing issue.

Both methods were able to discern between non infarcted, sub endocardially infarcted (1-50% scar) and transmurally infarcted segments (51 - 100%), as was circumferential 2D strain. The AUC was the same (0.75) for longitudinal Doppler and 2D strain, while circumferential 2D strain  had AUC 0.85, again it is not clear whether this difference is significant. Global Doppler and 2D strain had probably no significant difference in ability to discriminate large (>20% of infarct mass) from small infarcts (AUC 0.85 vs 0.88).

However, when looking separately on apical and inferior infarcts, it seemed that the longitudinal Doppler strain was the only longitudinal method separating non transmural from transmurally infarcted segments, while the results for longitudinal 2D strain did not reach significance. For inferior infarcts, the separation between normal and sub endocardially infarcted segments was borderline significant by both methods, in apical infarcts 2D strain was not significant. This may be due to the increased curvature dependency in the apex; look at the small apical infarct in the main section. Another study comparing longitudinal and circumferential strain by 2DS, did not find differences in longitudinal strain between sub endocardial and transmural infarcts (221). This may not be the real case, but simply express the difference in sensitivity between 2DS and tissue Doppler.

Reproducibility was good by both methods, slightly better by 2D strain, which is to be expected, as only speckle tracking was processed by the smoothed 2DS strain method.

In the setting of acute myocardial infarction, the STEMI is readily identified, and should preferably be selected to immediate invasive treatment. IN NSTEMI, there is a substantial proportion of patients with occluded infarct related artery (IRA), which may hypothetically profit from the same immediate invasive treatment. The identification of these, might be useful. 2D strain has been applied to early echo in NSTEMI. It has been shown that global longitudinal strain correlates with final infarct size (189), and was a far better predictor than EF, but no better than WMSI. The extent of the area at risk in terms of number of dysfunctional segments, has been shown to be related to the presence of occluded IRA (218). The number of dysfunctional segments were larger by longitudinal strain than by WMS, presumably because of better sensitivity, thus adding to the emerging indications that quantitative deformation measurements may add some information. However, this was not confirmed in another study by the same group (219), where longitudinal strain by vascular territories was no better than WMS in predicting occlusion, while circumferential territorial strain by 2DS was better. Thus, this may still be considered somewhat unanswered.

Transmural and circumferential strain:

Transmural and circumferential strain must be analysed in cross sectional views, and can only be analysed by speckle tracking. Tissue Doppler may give transmural strain in the anterior and inferior segments (crosswise), and circumferential strain in the lateral and medial segments (tangentially).

In a study by Becker et al (212), 2D strain was used to measure transmural and circumferential strain in short axis. They did show a reduction in transmural and circumferential strain with increasing transmurality, but still 11% transmural and 8% circumferential strain in transmural infarcts (50 - 100% transmurality), and with standard deviations of 5 - 7% for transmural and 8 - 10% for circumferential strain, the overlap was considerable. All differences were significant, but the sensitivity and specificity for separating non transmural (1 - 50% scar) from transmural (50 - 100% scar) was about 70% for both strain components, and even slightly lower for separating non infarct from non- transmural infarct. The authors argue that the method is less noisy than tissue Doppler, without showing any comprehension that this is due to the degree of smoothing. As shown above, the tissue Doppler curves become just as smooth when processed with the same application. Also, as we can se above, the transmural strain, and hence, the circumferential, is highly processing dependent. They also describe that strain in completely transmurally infarcted segments was not zero, and ascribes this to tethering, without any reference to the processing issues of the 2D strain application in spline smoothing, evening values between segments and ROI width determining the value of strain. The dependency of the transmural strain on ROI width may also be a factor in this finding.

In the study by Chan et al (221), the circumferential 2D strain was able to discern between transmural and non transmural infarcts, while longitudinal 2D strain was not, despite previous findings that longitudinal strain was proportional with infarct transmurality, both with TDI (210, 220) and 2Dstrain (205). Some of the effect may be method specific:
  1. As 2DS may have lower sensitivity for reduced longitudinal function due to smoothing. Thus seemed to be the case in the study by Sjøli et al (213) as well, when longitudinal and circumferential 2DS were compared.  Difference between subendocardial and transmural longitudinal strain was significant by tissue Doppler, but only circumferential strain by 2DS.
  2. The ROI may not reflect the real wall thickness, the circumferential strain is dependent on ROI thickness, and the reduced strain may not be apparent in subendocardial infarcts. 

Studying NSTEMI, in one study (219), longitudinal strain by vascular territories was no better than WMS in predicting occlusion, while circumferential territorial strain by 2DS was better, despite previous findings by the same group (218).










Stress echocardiography:






The interpretation of stress echocardiography is dependent on the subjective assessment of wall thickening (eventually substituted by wall motion, meaning endocardial excursion, but this may be less specific for preserved function as segments may move by tethering). This is subjective, and provides only semi quantitative data. It has been shown to be extremely experience dependent, as trained echocardiographers with no specific training in stress echo has only a sensitivity of 65%, i. e. no better than exercise ECG, while expert stress echocardiographers has about 85% to 90%, comparable to myocardial SPECT perfusion imaging (101), as illustrated below.




 Furthermore,
it has also been shown that visual assessment has poor temporal resolution, (usually about 100 ms, with training down to 80 ms), and therefore has limited ability to detect more subtle changes in myocardial function (102), although this can be compensated by increased frame rate and lower replay rate, a point not raised in the study. Inter institutional reproducibility has been shown to be low, a study from 1996 (103) did show a kappa coefficient of 0.37, sensitivity of 76, specificity 87%. Introducing second harmonic imaging increased the reproducibility to 0.69 in intra institution agreement (104) and 0.55 inter institution (105). The sensitivity was 92%, substantially better than the study from 1996, but still at the same level as reported in other studies (101). Fundamental imaging, however, did show a decrease in sensitivity compared to 1996. This illustrates a general principle, whenever a new method becomes available, the accuracy of older methods decreases. However, without fundamental imaging, more patients may be classified as non-echogenic, indicating that with harmonic imaging more patients became eligible for stress echo at sufficient diagnostic accuracy.

Myocardial velocities

Still, the method remains experience dependent and semi quantitative. Tissue Doppler has the promise of increased temporal resolution as well as quantitative and objective measurement. Peak systolic velocity is a robust measurement, as well as closely related to contractility. Peak segmental systolic velocity during DSE was shown to be reduced in segments with reduced wall motion score and segments supplied by a stenosed artery (106). This was further elucidated in a study where patients and normal subjects were compared (107). Feasibility was 92% of segments, normal values were established in the normal group, and cut off was set to give a specificity of 80%. The definition of the normal dobutamine response was set in each segment, derived from normal subjects, patients with a normal 2D dobutamine response and patients with normal coronary angiography. The study measured all feasible segments in the basal and midwall levels.  The sensitivity and specificity of systolic velocities for affected vascular territories was 83 and 72%, vs. 88 and 81% by wall motion scoring. Limits of agreement was 0.2 cm/s for inter observer and concordance 86%. Analysis was not feasible in the apex, due to the low velocity and poor depth resolution in the near field. Thus, Systolic velocities seem to give comparable results, but not better, than Wall motion scoring. However, the diagnostic accuracy by tissue Doppler was the same by novice interpreters (76%), expert echocardiographers (74%) and slightly lower than expert stress echocardiographers (6%) as compared to wall motion scoring (68, 71 and 88% respectively) (108).



Velocities were measured in the middle of each segment.
Of the 77 patients investigated, 55 had significant coronary artery disease. Nineteen patients (25%) had 1-vessel disease, 17 (22%) had 2-vessel disease and 19 (25%) had 3-vessel disease. Of all the patients studied, 40 (52%) had disease of the left anterior descending artery; 33 (43%) had involvement of the left circumflex artery, and 37 (48%) had involvement of the right coronary artery. The criteria for a positive test by tissue velocity (one or more segments, how much below below the cut off limit), is not reported, but all twelve midwall and basal segments were analysed.

Another study, the multi centre MYDISE study, reported a similar feasibility but slightly less reproducibility in using the segmental velocities (109), with coefficients of variation of 
11–18% for peak systolic velocity at peak stress in basal, 14–28%  in mid segments and 29–69% in in apical segments. This study also concludes that the apical velocities are too low to give reproducible results. In this feasibility study, 10 normal studies were analysed by nine different observers. Feasibility was reported to be 90% of midwall and basal segments in 92 normal subjects. In the second part of that study (110), the diagnostic value was addressed in 289 patients. Cut off values were established by ROC analysis, in the 92 normal subjects from the previous study, and 48 patients with known coronary artery disease. Sensitivity and specificity was then studied in a prospective study of 149 unselected patients referred for chest pain, with coronary angiography (>50% stenosis) as reference. This group included 59 normal, 36 (24%) patients with single vessel, 27 (18%) with double vessel and 27 (18%) with triple vessel disease.


Peak systolic velocity at peak stress, rather than change in velocity from baseline was the best discriminator of disease, but sensitivity was only 63% - 69% and specificity 60 – 67% for the different vascular regions, which is somewhat lower values than reported by the Brisbane group, and with cut of values of 10 - 12 cm/s in the basal segments. However, when a regression model including age, gender and peak heart rate was applied, sensitivity increased to 80 – 93% and specificity to 80 – 82%. These results imply that not only heart rate, but also age and gender should be taken into account when interpreting stress echo by tissue Doppler.

The differences in cut off values between the two studies can in part be explained by the fact that segmental velocities in Brisbane were measured in mid segment, in MYDISE in the base of the segment. As velocities increase from the apex to the base, this means that normal segmental velocities (and hence, cut off values) will  be higher in the MYDISE study. The difference in sensitivity in the two studies for peak velocity alone, may in part be explained by the number of segments analysed. IN the MYDISE study, only 7 segments were analysed, and it seems that positivity is defined by segmental velocities being reduced only in the specific vascular areas ((LAD: BA and MS; Cx: BL and BP; RCA: BI, MI and BS). If so, the sensitivity may be sub substantially
reduced.
Frame rates are not reported in either study, but tended to be somewhat lower than what is customary at present (especially in the MYDISE study), this might result in some under sampling, so the cut off values might be higher with higher frame rate. So far no studies has addressed the timing of  motion by tissue velocities as an additional variable. peak velocities does not include the asynchrony induced by the delayed onset and post systolic shortening that is a marker of ischemia. Just looking at the timing of peak velocities if there is a suspicion of asynchrony will often answer this as illustrated below.

Even though peak velocities show comparable accuracy as wall motion score, at least for detection of ischemia, tethering makes the true location of ischemia difficult. That may be part of the problem of the MYDISE sty as well, analysing only typical segments and considering them positive only for stenoses in the vessels of the vascular territories considered.

It has previously been shown that the segmental specificity of velocities is low (40). Thus, reducing the number of segments necessary for a positive test, will reduce the specificity. Basically, using velocities in stress echo should be considered a screening for an ischemic response. The actual location of the ischemic areas should hypothetically be shown
better in strain rate and strain.

Strain rate imaging

Feasibility of strain rate imaging was addressed in a study by Davidaviticus et al (111). They found that 95 % of segments were analysable during dobutamine stress. Due to noise problems strain rate imaging was not feasible during treadmill or bicycle stress. The study, however, was small and was limited to healthy individuals. The normal response during dobutamine stress was an increase in velocity, strain rate and strain at low dose dobutamine, a further increase in velocity and strain rate at high dose, when strain showed a plateau. This is intuitive, concordant with an initial increase in contractility and stroke volume at low dose dobutamine, giving increased stroke volume, but with increased heart rate without increased venous return at higher dobutamine levels resulting in a plateau or even diminished stroke volume. Velocity/SR of contraction, however, continues to increase as ejection time shortens. This in opposition to exercise, where increased venous return increases stroke volume even at high heart rate.Kowalski et al (112) extended the testing of SRI to patients with coronary artery disease, 20 patients with chest pain, 16 with positive coronary angiography were examined. Feasibility was over 95% of segments. Both narrow angle and wide angle sector gave similar results. Peak systolic strain rate showed a linear increase from baseline to peak. Ischemic segments (critical stenoses) showed no increase in strain nor strain rate during low and high dose dobutamine.However, they found that some ischemic segments showed normal velocity responses to dobutamine, and suggest that this is due to tethering.  A different explanation can be that isovolumic contraction velocities are mistaken for peak velocity during ejection, as shown in this clinical example. No overall analysis was done for the diagnostic criteria of ischemia out over clinical examples.Their study confirms that SRI may have a clinical potential, but was not designed to determine the ability of SRI to diagnose coronary artery disease.

The clinical value of SRI was addressed in a study by Voigt et al (113). The study included 44 patients and single photon emission computed tomography (SPECT) was used as reference method for ischemia, but with coronary angiography as well. The study reports 100% sensitivity and specificity of  SPECT compared with coronary angiography, somewhat higher than usual, indicating that this material is somewhat selected. Then SPECT is used as the gold standard for ischemia. In general, the sensitivity of SPECT against coronary angiography is around 90%. It can easily be argued that angio does not show ischemia, thus SPECT is a better reference. But that assumes a perfect sensitivity of SPECT. It can easily well be argued that bot SPECT and stress echo has limited sensitivity, and in that case, not all studies will show ischemia by both methods, and coronary angiography will then serve as an external reference that is the same for both methods.

In this study the feasibility was 92% for tissue velocities, and 85% for SRI, which is reasonable in our experience with SRI artefacts.

In non ischemic segments, peak systolic strain rate increased significantly with dobutamine stress, from -1.6 ± 0.6s-1 to 3.4 ± 1.4s-1, while strain during ejection time changed only minimally 17 ± 6% to 16 ± 9%. During dobutamine, 47 myocardial segments in 19 patients developed scintigraphy-proven ischemia. Strain-rate
increase from 1.6 ± 0.8s-1 to 2.1 ± 1.1s-1and strain decreased from 16 ± 7% to 10 ± 8%, both significantly different from non ischemic segments. Post systolic shortening (PSS) was found in all ischemic segments. By ROC analysis, the AUC was 0.57 for peak strain (not surprising as this includes post systolic strain), 0.65 for end systolic strain, 0.74 for peak systolic strain rate, 0.8 for time to end of negative strain rate and 0.9 for the ratio of  post systolic index (PSI). The PSI thus seemed to be the best parameter to identify stress-induced ischemia, however, as no confidence intervals are given, the significance of the differences cannot be ascertained. Furthermore, in qualitative analysis  of parametric strain rate imaging, compared with conventional grey scale readings SRI curved M-mode improved sensitivity/specificity from 81/82% to 86/89%. The statistical significance of this difference, however, is not given in the paper.
 





Colour SRI M-modes from septum of the same examination, showing clearly at 20 µg/kg/min the development of a prolonged shortening period in the apex,  but still systolic shortening as well. During peak stress, there is virtually no systolic shortening, only post systolic.
Strain curves at 20 µg/kg/min (top) and peak stress (bottom), showing systolic hypokinesia at low dose with PSS and akinesia in septum / dyskinesia laterally with PSS.

In a further paper from the same study (114), giving pretty much the same data, SRI is also compared to tissue velocities and displacement for diagnostic accuracy. Visual wall motion had a sensitivity/specificity of  81/82%, post systolic strain /peak strain had a sensitivity of 81/82% and segmental tissue velocities 74/63%. These numbers, however, refers to sensitivity against SPECT. One might theoretically apply this to a sensitivity of 90%, and end up with a traditional sensitivity of WMS against angiography of  73%, which is definitely lower than in other studies using harmonic imaging (104, 105). Again an instance of new methods leading to a decrease in sensitivity of established methods.

This might be due to several reasons:
The grey scale images are taken with Tissue Doppler data in the background, thus reducing the image quality and frame rate slightly. Recordings may to a certain degree be less optimised for endocardial visibility, as proper alignment are more important for tissue Doppler. Patients are included with less regard to good grey scale image quality as one expects to rely on both grey scale and tissue Doppler information. This, however, will make stress echo available to more patients, but as the general grey scale sensitivity may be expected decline, on has to utilise both 2D and tissue Doppler information.

This is analogous to the effect seen with harmonic imaging, and a parallel effect is described when using contrast stress echo for left ventricular opacification (123), although that study does not give the number of substandard recordings without contrast, nor the impact on sensitivity.

The velocity accuracy in detecting ischemia alone is comparable with the MYDISE study (110). This again illustrates the main principle, as segment velocities are dependent on overall function and adjacent segments, they are not suited to segmental analysis (40). If one only analyses the velocities in the ischemic segments, the sensitivity will be low as well. The overall sensitivity of peak velocity in any segment for presence of ischemia is not given. However, by the same principle, the overall sensitivity in detecting ischemia anywhere is good, if all segments are analysed, as shown by the Brisbane group. I  Thus, peak velocities is a fair screening for ischemia. On the other hand, for locating the area of ischemia, the strain rate indices is probably best.


A larger, cooperative study of dobutamine stress from Trondheim and Brisbane (128), including 197 patients, where half were used for ROC analysis, the other half to test the sensitivity and specificity of findings, showed that the feasibility of analysis was 65 - 85% of segments at peak stress, depending on method, but significantly lower than WMS by B-mode, which was 98%. Another study (132) showed a feasibility of 92% of segments at peak stress. However, this may reflect a too low rejection rate for reliable results as discussed elsewhere. Still, analysis was feasible in all patients. The results showed higher sensitivity with SRI, both by velocity gradient and segmental strain by combined ST-TDI, (87 and 84% for peak systolic strain rate and 87 and 88% for strain) versus WMS (75%). The significance of this was 0.02 - 0.04) There was a trend towards better specificity and accuracy of strain rate and strain compared to WMS, as well, but this was not significant.The post systolic index, however, was significantly poorer than strain and strain rate, (p 0.01 - 0.04), and with sensitivity of 71% probably no better than WMS.




The sensitivity of 3 strain rate imaging parameters during peak stress; Peak syst. strain rate, end syst. strain and post systolic index PSI. Values for the segmental strain method and the velocity gradient method. sensitivity by PSI was significantly less, but only by the velocity gradient method.  From (128)
Sensitivity of wall motion score (WMS) versus peak systolic strain rate and end systolic strain by both segmental strain and velocity gradient. Difference between either strain method and WMS  was significant. From (128)
Also, a larger prospective study from Brisbane (133), including 646 patients with an average follow up time of 5.2 years, prediction of all cause mortality was analysed in terms of clinical variables (diabetes mellitus, age, previous MI), resting and stress wall motion abnormalities and stress strain rate and strain. Peak wall motion score index, mean SR(s), segmental S(es), and segmental SR(s) were all predictors of mortality, but only segmental SR(s) (hazard ratio, 3.6; 95% CI, 1.7 to 7.2) was independently predictive. In sequential Cox models, the model based on clinical data (overall chi2, 12.7) was improved by peak wall motion score index (18.4, P=0.002) and further increased by either mean SR(s) (25.7, P=0.009) or segmental SR(s) (31.8, P<0.001).
It has been suggested that this may be that the mean value identified patients with regional ischemia who did not have the capacity of compensating with hyperkinesia in other segments. In that case this may be the incremental information, that is not derived by WMS. Peak systolic strain rate again had better predictive value than strain.
Incremental prognostic value of SRI variables in a series of Cox regression models predicting all-cause mortality. The clinical variables (diabetes mellitus, age, previous MI) were entered together (1), followed by separate models by combination of these with either resting WMSI (2) or stress WMSI (3). Then, clinical variables plus stress WMSI were entered together, and each SRI variable added in separate analyses: 4: segmental end-systolic strain, 5: mean peak systolic strain rate and 6: segmental peak systolic strain rate. From (133)

Thus there is reason to accept these results as evidence of independent value of strain rate imaging.

The clinical evidence is about dobutamine stress only. Early experience (111) seem to indicate that it is less feasible during exercise stress due to increase in motion artefacts, although this evidence is limited.

Still, Strain rate has a lot of pitfalls, and they tend to become even more exaggerated with increasing stress. A critical eye should always be applied to data quality before analysis, and all segments with low data quality should be discarded. The parametric imaging, is probably superior to visualise the extent of ischemia (and indeed to see if the curves are credible at all by having a certain extent), as well as for timing, especially tardykinesia. Only one stress echo study so far (113) has addressed the qualitative visual assessment of colour SRI.

Finally, as with all echo measurements, SRI should always be considered a part of the total echo examination.


Back to website index.

References:



Editor: Asbjørn Støylen Contact address: asbjorn.stoylen@ntnu.no, Updated: 2012