911 F.2d 941
United States Court of Appeals, Third Circuit.
942*942 John R. Connelly, Jr. (argued), Drazin and Warshaw, P.C., Red Bank, N.J., for appellants.
Susan Scott (argued), David W. Garland, Riker, Danzig, Scherer & Hyland, Morristown, N.J., for appellee Merrell Dow Pharmaceuticals.
Before BECKER, and STAPLETON, Circuit Judges and KELLY, District Judge[*].
OPINION OF THE COURT
STAPLETON, Circuit Judge:
This is an appeal in a diversity action brought under New Jersey law by the DeLuca family against Merrell Dow Pharmaceuticals Corporation, the manufacturer of 943*943 Bendectin. The DeLucas seek damages for severe birth defects suffered by Cindy DeLuca's daughter Amy. Amy was born with limb reduction defects of the lower extremities: the lower portion of her left leg is deformed with anterior bowing of the tibia, absence of the fibula and three toes, and considerable shortening; and her right foot is missing a toe. The DeLucas allege that these birth defects were caused by Cindy DeLuca's use of Bendectin during the time she was pregnant with Amy.
Merrell Dow filed a motion for summary judgment alleging that the only causation evidence produced by the DeLucas was inadmissible because all relevant epidemiological studies have determined there is no statistically significant link between the use of Bendectin during pregnancy and the type of birth defects suffered by Amy DeLuca and these studies were the only reasonable basis for expert opinions. In response, the DeLucas proffered affidavits and deposition testimony by Dr. Alan Done, an expert in pediatric pharmacology, in which Dr. Done opined that the available epidemiological data does support the conclusion that Bendectin causes limb reduction defects and that he believed, to a reasonable degree of medical certainty, Bendectin caused Amy's defects. The district court held that Dr. Done's testimony would be inadmissible at trial because it was not based on data of a type reasonably relied upon by experts in the pertinent fields in issuing opinions on these subjects, as is required by Federal Rule of Evidence 703. 131 F.R.D. 71. Since Dr. Done's testimony was the sole causation evidence the DeLucas tendered in response to Merrell Dow's motion, the district court entered summary judgment for Merrell Dow. On appeal, the DeLucas argue that the district court misapplied Federal Rule of Evidence 703 in excluding Dr. Done's testimony. We agree and we will reverse and remand for proceedings consistent with the principles articulated herein.
I. THE LEGAL AND SCIENTIFIC SETTING
This is one of the last of over 1,000 suits alleging that birth defects were caused by the drug Bendectin. Bendectin, a prescription drug prescribed for morning sickness in pregnant women, was first approved for sale by the Food and Drug Administration in 1956. Public expressions of concern about Bendectin's relationship to birth defects mounted in the 1970's. In response, Bendectin's safety was examined by the FDA, and in 1980, the FDA's Advisory Committee on Fertility and Maternal Health concluded that the relevant information "did not demonstrate an increased risk of birth defects with Bendectin use" but urged that studies be continued. App. at 195. The FDA continues to approve its sale for use during pregnancy.
Despite the committee report and the fact that no published study has concluded that Bendectin increases the risk of birth defects, thousands of tort cases were filed by plaintiffs alleging that Bendectin had caused their children's birth defects. While Merrell Dow prevailed in the most prominent of the trials arising out of these numerous cases, a multi-district common issues trial involving over 800 cases, it has also had large verdicts entered against it in other suits, though most of these have been reversed on appeal or overturned on a motion for judgment n.o.v. As a result of escalating insurance and litigation costs resulting from these cases, and decreased use of Bendectin flowing from the controversy surrounding its safety, Merrell Dow has ceased production of Bendectin.
In this case, the district court faced one of the difficult questions that has pervaded Bendectin litigation to this point: whether an expert may testify, in light of existing scientific knowledge, that Bendectin is a teratogen, i.e., an agent that causes birth defects. The district court held Dr. Done's testimony to be inadmissible, citing the requirement of Federal Rule of Evidence 703, that expert opinion be based on data reasonably 944*944 relied upon by experts in the relevant field.The district court reached this conclusion despite the fact that most of the data relied upon by Dr. Done was data from peer reviewed articles in medical journals that was relied upon by the authors of these articles, as well as by Merrell Dow's own expert.
In the record that served as the basis for the district court's decision, Merrell Dow did not identify particular data sets it believed Dr. Done could not reasonably rely upon. Nor did it address the specific methodology and reasoning underlying Dr. Done's conclusion that Bendectin is a teratogen. Instead, Merrell Dow relied upon the great weight of scientific opinion in its favor and upon prior cases in which testimony that Bendectin is a teratogen was held to be inadmissible or insufficient to support a verdict. This was consistent with its apparent litigation strategy which was to emphasize that "[i]n all material respects, the instant case is identical to the cases where summary judgment has been granted in Merrell Dow's favor." App. at 38.
Following Merrell Dow's lead, the district court did not point to specific deficiencies in the data utilized by Dr. Done and while it cited Rule 703, it made no record-supported, factual finding that Dr. Done had relied upon data experts in the field would have considered unreliable. Instead, the district court devoted most of its opinion to surveying the case law cited by Merrell Dow. In only two brief sentences of its opinion did the district court address Dr. Done's statistical analysis of the available epidemiological evidence. The first sentence states that the authors of the studies used by Dr. Done concluded that a "statistically significant" link between Bendectin and birth defects existed only for defects other than limb reduction defects or concluded that Bendectin does not cause birth defects. App. at 29. Dr. Done, as we shall see, readily admits that his interpretation of the data collected for these studies differs from the authors'. The second sentence appears to discard Dr. Done's analysis because he is not an epidemiologist, id., despite Merrell Dow's express agreement to assume, for purposes of its motion for summary judgment, that Dr. Done was qualified to read and interpret epidemiological studies. On this basis, the district court held that the DeLucas had "not approached a showing that Dr. Done's opinion has a foundation as required by Federal Rule of Evidence 703." Id.
A. Standard of Review
Our review of a district court's decision to exclude the testimony of an expert is ordinarily limited to ensuring there has been no abuse of discretion, but to the extent the district court's ruling turns on an interpretation of a Federal Rule of Evidence our review is plenary.In re Japanese Electronic Products Litigation, 723 F.2d 238, 276 (3d Cir.1983), rev'd on other grounds sub. nom., Matsushita Electric Industrial Co. Ltd. v. Zenith Radio Corp., 475 U.S. 574, 106 S.Ct. 1348, 89 L.Ed.2d 538 (1986); United States v. Furst, 886 F.2d 558, 571 (3d Cir.1989). The standard of review of a district court's entry of summary 945*945 judgment is plenary, and we apply the same standard as the district court. Erie Telecommunications, Inc. v. City of Erie, 853 F.2d 1084, 1093 (3d Cir.1988). Summary judgment is appropriate when, after considering the record evidence in the light most favorable to the nonmoving party, no genuine issue of material fact exists and the moving party is entitled to judgment as a matter of law.Fed.R.Civ.P. 56(c).
B. The Relevant Scientific Principles and Tendered Evidence
To competently analyze the legal issues presented by this appeal, an understanding of the relevant scientific principles, albeit necessarily a rudimentary one drawn primarily from the relevant sources cited to by the parties, is essential.Problematic issues of causation arise in Bendectin cases because the etiology of most birth defects is unknown. There is no apparent way to determine from clinical examinations of Amy DeLuca whether her limb defects were the result of her mother's exposure to Bendectin, as opposed to another possible teratogen, or whether her birth defects are simply an inexplicable natural occurrence not induced by her mother's exposure to an outside agent. Rather, the only particularistic evidence the DeLucas can show to strengthen the inference that Amy DeLuca's birth defects were caused by Bendectin is to rule in Bendectin as a possible cause by showing that Amy was exposed to it during the time her limbs were developing, i.e., during organogenesis, and to rule out other possible causes by showing that Amy was not exposed to them during the critical period of organogenesis. Merrell Dow did not contend before the district court that the DeLucas failed to present sufficient evidence in this regard.
Thus, the DeLucas must rely primarily on inferences drawn from epidemiological data to show causation in Amy's case. Epidemiology, a branch of science and medicine, uses studies to "observe the effect of exposure to a single factor upon the incidence of disease in two otherwise identical populations."Black & Lilienfeld, Epidemiological Proof In Toxic Tort Litigation, 52 Fordham L.Rev. 732, 755 (1984).In the Bendectin context, an epidemiological study ideally attempts to determine the incidence of birth defects among the children of two groups of women, identical in all respects except for their use of Bendectin during pregnancy. Epidemiological studies do not provide direct evidence that a particular plaintiff was injured by exposure to a substance.Such studies have the potential, however, of generating circumstantial evidence of cause and effect through a process known as hypothesis testing, a process which "amounts to an attempt to falsify the null hypothesis and by exclusion accept the alternative."K.J. Rothman, Modern Epidemiology 116 (1986) ("Rothman"). The null hypothesis is the hypothesis that there is no association between two studied variables, id.; in this case the key null hypothesis would be that there is no association between Bendectin exposure and an increase in limb reduction defects.The important alternative hypothesis in this case is that Bendectin use is associated with an increased incidence of limb reduction defects.
The great weight of scientific opinion, as is evidenced by the FDA committee results, 946*946 sides with the view that Bendectin use does not increase the risk of having a child with birth defects. Sailing against the prevailing scientific breeze is the DeLucas' expert Dr. Alan Done, formerly a Professor of Pharmacology and Pediatrics at Wayne State University School of Medicine, who continues to hold fast to his position that Bendectin is a teratogen. In spite of his impressive curriculum vitae, Dr. Done's opinion on this subject has been rejected as inadmissible by several courts.
Dr. Done's opinion that Bendectin is a teratogen largely rests on inferences he draws from epidemiological data, most of which he contends are the same that was utilized by the experts, including the FDA committee, to whom Merrell Dow cites to bolster its contention that Bendectin does not cause birth defects. The principal difference is that Dr. Done analyzes that data using an approach, advocated by Professor Kenneth Rothman of the University of Massachusetts Medical School, that places diminished weight on so-called "significance testing." See K.J. Rothman, Modern Epidemiology (1986) ("Rothman"); see also, Rothman, A Show of Confidence, New Eng. J. of Medicine, Dec. 14, 1978, 1362.
Epidemiological studies, of necessity, look to the experience of sample groups as indicative of the experience of a far larger population. Epidemiologists recognize, however, that the experience of the sample groups may vary from that of the larger population by chance. Thus, a showing of increased risk for birth defects among women using Bendectin in a particular study does not automatically prove that Bendectin use creates a higher risk of having a child with birth defects because the discrepancy between the exposed and unexposed groups could be the product of chance resulting from the use of only a small sample of the relevant populations.947*947 As a result of the acknowledged risk of this so-called "sampling error," researchers typically have rejected the associations suggested by epidemiological data unless those associations survive the rigors of "significance testing." This practice has also found favor in the legal context. A number of judicial opinions, discussed infra, have found Bendectin plaintiffs' causation evidence inadmissible because every published epidemiological study of the relationship of Bendectin exposure to the incidence of birth defects has concluded that there is not a "statistically significant" relationship between these two events.
Significance testing has a "P value" focus; the P value "indicates the probability, assuming the null hypothesis is true, that the observed data will depart from the absence of association to the extent that they actually do, or to a greater extent, by actual chance."Rothman, supra, at 116. If P is less than .05 (or 5%) a study's finding of a relationship supportive of the alternative hypothesis is considered statistically significant, if P is greater than 5% the relationship is rejected as insignificant.Accordingly, the results of a particular study are reported as simply "significant" or "not significant" or as P<.05 or P>.05.
Use of a .05 P value to determine whether to accept or reject the null hypothesis necessarily enhances one of two types of possible error. Type one error is when the null hypothesis is rejected when it is in fact true. Type two error is when the null hypothesis is in fact false but is not rejected. Rothman notes that at .05, the null hypothesis will "be rejected about 5 per cent of the time when it is true," a relatively small risk of type one error. Id. at 117. Unfortunately, the relationship between type one error and type two error is not simple; however, one study in the context of an employment discrimination case concluded that when the risk of type one error equalled 5%, the risk of type two error was 50%. Cohen, Confidence in Probability: Burdens of Persuasion in a World of Imperfect Knowledge, 60 N.Y.U.L.Rev. 329, 411 & n. 116 (1985) (citing Dawson, Investigation of Fact—The Role of the Statistician, 11 Forum 896, 907-08 (1976)).Type one error may be viewed here as the risk of concluding that Bendectin is a teratogen when it is not. Type two error is the risk of concluding that Bendectin is not a teratogen, when it in fact is.
Rothman contends that there is nothing magical or inherently important about .05 significance; rather this is just a common value on the tables scholars use to calculate significance. Rothman, supra, at 117; see also Cohen, supra, at 412 (noting that the .05 level of significance used in the social and physical sciences is a conservative and arbitrary value choice not necessarily valuable in the legal setting); Kaye, Is Proof of Statistical Significance Relevant?, 61 Wash.L.Rev. 1333, 1343-44 (1986). He stresses that the data in a certain study may indicate a strong relationship between two variables but still not be "statistically significant" and that the level of significance which should be required depends on the type of decision being made and the relative values placed on avoiding the two types of risk.
To convey both the extent to which two variables are associated in the data, and the extent to which this association might be the product of chance, Rothman advocates reporting both a "relative risk" (or point estimate) and "confidence intervals." In the context of an epidemiological study of Bendectin's relationship to birth defects, the relative risk is the ratio of the incidence rate of birth defects in the study group exposed to Bendectin divided by the rate in the control group not exposed to Bendectin. Black & Lilienfeld, supra, at 758. If a study found no difference in the rate of birth defects between the Bendectin exposed group and the control group, it yields a relative risk identical to the null hypothesis that Bendectin exposure is not associated with an increased incidence of birth defects. The relative risk would thus be reported as "1", signifying no difference between the rate of birth defects in each group.
948*948 A confidence interval is a way of graphically representing the probability that the relative risk figure or any other relationship between two studied variables is the actual relationship. The interval is a range of sets of possible values for the true parameter that is consistent with the observed data within specified limits. Rothman, supra, at 119; D. Barnes & J. Conley, Statistical Evidence in Litigation, § 3.15 at 107 (1986) (defining a confidence interval as a limit above or below or a range around the sample mean, beyond which the true population is unlikely to fall). A 95% confidence interval is constructed with enough width so that one can be confident that it is only 5% likely that the relative risk attained would have occurred if the true parameter, i.e., the actual unknown relationship between the two studied variables, were outside the confidence interval.If a 95% confidence interval thus contains "1", or the null hypothesis, then a researcher cannot say that the results are "statistically significant," that is, that the null hypothesis has been disproved at a .05 level of significance. Kaye, Is Proof of Statistical Significance Relevant?, supra, at 1348.
The result of a study should be reported, in Rothman's view, by reference to the confidence intervals at various confidence levels, e.g., 90%, 95%, 99%. The inclusion of confidence intervals of a variety of levels reflects Rothman's view that the predominating choice of a 95% confidence level is but an arbitrarily selected convention of his discipline. More importantly, however, Rothman insists that the precise locations of the boundaries of the confidence intervals, the all important focus of "significance testing," are far less important than their size and location. According to Rothman, statistical theory suggests that it is "much more likely that the [true] parameter [i.e. the true relationship between the studied variables] is located centrally within an interval than it is that the parameter is located near the limits of the interval." Rothman, supra, at 124. As such, the primary focus should not be on the ends of an interval but rather on the "approximate position of the interval as a whole on its scale of measurement...." Id.
Finally, Rothman contends that the use of significance testing is especially unhelpful when a decisionmaker is attempting to draw inferences from more than one study. Different studies may each be rejected as insignificant, yet, when the studies are looked at collectively, a majority of the data may be moderately or strongly contradictory to the null hypothesis. By failing to look at the collective data in the context of confidence intervals and the most likely estimate for the true parameter suggested by that data, researchers focusing solely on significance testing tolerate a high risk of type two error. Id. at 117-18.
Rothman suggests a less rigid approach in which researchers look at the confidence intervals produced by various studies. By charting the range of possibilities consistent with the data found in different studies it is possible to evaluate whether the collective data is more supportive of the proposition that the null hypothesis is false than that it is true. Id. at 124. At the same time, the use of confidence intervals indicates the risks inherent in generating any estimate of the true parameter from the data, and allows the decisionmaker to adjust the confidence level depending on the context in which a decision is required. Id. at 123-25; see also Kaye, Is Proof of Statistical Significance Relevant?, supra, at 1364.
Dr. Done attached the article and chapter by Rothman to his affidavit on behalf of the DeLucas and expressly indicated that his analysis was predicated on the methodology advocated by Rothman. Dr. Done purports to have analyzed all of the epidemiological data from the published epidemiological studies of the relationship between birth defects and Bendectin, as well the data from several unpublished studies, utilizing the author's confidence interval if calculated, a 95% confidence interval if the author indicated a preference for that figure, or 90% otherwise.
Dr. Done has graphed the relative risks and confidence intervals for each of the separate sets of data together, so that the collective trend may be visualized. He concludes from analysis of these intervals that 949*949 the "bulk of the available human epidemiological data ... are indicative of Bendectin's human teratogenicity." App. at 345. Dr. Done contends that the effect in the data is strongest for, among other birth defects, limb reduction defects like Amy DeLuca's. Dr. Done did not, however, quantify the increased risk for limb reduction defects he believed was posed by use of Bendectin during pregnancy. Dr. Done's analysis has not been published nor has it been subjected to peer review by experts in the field.
C. The Bendectin Case Law
We recognize that the district court's decision to exclude Dr. Done's proposed testimony was heavily influenced by the decisions of other courts that have grappled with the difficult question of whether expert testimony that Bendectin causes birth defects is admissible and/or sufficient to sustain a verdict. A review of these cases is thus helpful, and should be preceded by a discussion of the most important of the prior Bendectin trials, a trial in which the admissibility and sufficiency of the plaintiffs' causation evidence was not a source of major dispute.
As federal dockets swelled in the early 1980's with Bendectin cases, the Judicial Panel on Multi-District Litigation transferred over 600 of these cases to the Southern District of Ohio for pre-trial discovery, where they were consolidated with 557 cases filed within that district. See In re Richardson-Merrell, Inc. "Bendectin" Products Liability Litigation, 624 F.Supp. 1212 (S.D.Oh.1985), aff'd in relevant part, 857 F.2d 290 (6th Cir.1988), cert. denied, 488 U.S. 1006, 109 S.Ct. 788, 102 L.Ed.2d 779 (1989). Over 800 of these cases were adjudicated in a common-issues trial before Chief Judge Rubin, who separated the question of whether Bendectin was a teratogen from the question of whether any particular plaintiff's birth defects were caused by Bendectin. After 21 days of testimony, during which the plaintiffs presented 10 expert witnesses, including Dr. Done, and 8 experts testified for the defense, id. at 1218, the jury was asked to answer this question:
Have the plaintiffs established by a preponderance of the evidence that ingestion of Bendectin at therapeutic doses during the period of fetal organogenesis is a proximate cause, [i.e. does it in a natural and continuous sequence produce injuries that would not have otherwise occurred], of human birth defects?
Id. at 1268. The jury unanimously answered no. Judge Rubin denied a post-trial motion for j.n.o.v. by the plaintiffs because "[b]oth sides presented testimony of eminently qualified and highly credible experts who differed in regard to the safety of Bendectin." Id. at 1244.
Despite this verdict, Bendectin cases continued to be litigated and contrary results achieved. In Oxendine v. Merrell Dow Pharmaceuticals, Inc., 506 A.2d 1100 (D.C.1986), for example, the District of Columbia Court of Appeals reinstated a verdict of $750,000 won by a plaintiff who alleged that her limb reduction defects were caused by her mother's use of Bendectin. The trial court had granted a j.n.o.v. for the defendants on the ground that the plaintiffs had not produced sufficient causation evidence.
The plaintiffs' sole causation witness in Oxendine was Dr. Done, who based his testimony on four types of information: (1) structure activity considerations, (2) in vitro animal studies, (3) in vivo animal studies, and (4) his interpretation of the available human epidemiological data on Bendectin's relationship to birth defects. The court of appeals was impressed by the careful and thorough nature of Dr. Done's testimony, in particular by his admissions that the first three types of evidence, while probative, could not definitively establish that Bendectin is a teratogen, and that no single epidemiological study demonstrated Bendectin's teratogenicity. Id. at 1108. Further, the court indicated that several of the defendant's experts conceded that Dr. Done's epidemiological methodology was sound, a concession that was bolstered by a statistical expert who testified that the more appropriate focus in analyzing epidemiological studies was on the relative risk 950*950 shown in the data, rather than on statistical significance. Id. at 1109.
Given these factors, the reviewing court found that Dr. Done's testimony provided a sufficient basis upon which to rest plaintiff's verdict, and that the trial court had erred by fragmenting Dr. Done's testimony:
Like the pieces of a mosaic, the individual studies showed little or nothing when viewed separately from one another, but they combined to produce a whole that was greater than the sum of its parts: a foundation for Dr. Done's opinion that Bendectin caused appellant's birth defects. The evidence also established that Dr. Done's methodology was generally accepted in the field of teratology, and his qualifications as an expert have not been challenged.
Id. at 1110.
A sharply different view of the sufficiency of Dr. Done's testimony regarding the teratogenicity of Bendectin has been taken by the Courts of Appeals for the First and District of Columbia Circuits. In Lynch v. Merrell-National Laboratories, 830 F.2d 1190 (1st Cir.1987), and Richardson by Richardson v. Richardson-Merrell Inc., 857 F.2d 823 (D.C.Cir.1988), cert. denied, ___ U.S. ___, 110 S.Ct. 218, 107 L.Ed.2d 171 (1989), those courts reviewed appeals from grants of j.n.o.v. in Bendectin limb reduction defect cases. Both held that Dr. Done's opinion that Bendectin is a teratogen is not only insufficient to support a verdict in light of the currently available scientific and medical evidence, but that it is inadmissible under Federal Rule of Evidence 703. Each court held that an opinion that Bendectin is a teratogen would be admissible under Federal Rule of Evidence 703 only if it were based on a new epidemiological study concluding that Bendectin was associated in a statistically significant way with an increase in birth defects.
In so holding, each court placed heavy emphasis on the large number of human epidemiological studies of Bendectin's relationship to birth defects in the scientific literature, and the fact that none of this peer-reviewed literature had concluded that there was a statistically significant association between Bendectin and birth defects.
We face ... a situation in which limb reductions are a fairly unusual subspecies of defect, in which the origin of most limb reduction is unknown, in which world-wide scientific investigations of Bendectin have produced no evidence establishing that Bendectin causes limb reduction, and in which the irrelevance of Bendectin to the incidence of limb defects has been demonstrated. The ignorance that prevails as to the etiology of most birth defects does not mean causation in a given case could not be proven; it does mean that there is a large terra incognita where gossip and guesswork abound, so that courts must carefully control the basis for testimony pointing to a particular cause.A new study coming to a different conclusion and challenging the consensus would be admissible evidence. Without such a study there is nothing on which expert opinion on Bendectin as a cause may be based.
Lynch, 830 F.2d at 1194 (emphasis added); accord Richardson, 857 F.2d at 832 ("the wealth of published epidemiological data ... none of which has concluded that the drug is teratogenic ... must be given their just due");Ealy v. Richardson-Merrell, Inc., 897 F.2d 1159, 1163-64 (D.C.Cir.1990).
A somewhat different route arriving at the same destination was taken by the Fifth Circuit in another Bendectin limb reduction case. Brock v. Merrell Dow Pharmaceuticals, Inc., 874 F.2d 307, modified, 951*951 884 F.2d 166 (5th Cir.1989), cert. denied, ___ U.S. ___, 110 S.Ct. 1511, 108 L.Ed.2d 646 (1990). The Brock court indicated that the conflicting evidence on the subject of Bendectin's teratogenicity, the difficulties of determining whether any given substance is a teratogen, and science's inability to trace a known birth defect back to its cause leads to the potential for inconsistent verdicts in virtually identical cases. This uncertainty creates the danger that useful medicines will be withdrawn from the market and that new medicines will not be made available, not because they are harmful, but because of manufacturers' fear of tort liability. 874 F.2d at 309 & n. 9. The court opined that "[a]ppellate courts, if they take the lead in resolving those questions upon which juries will go both ways, can reduce some of th[is] uncertainty which can tend to produce a suboptimal amount of new drug development." 874 F.2d at 310. In reversing a jury verdict for plaintiffs, the court did not hold that the plaintiffs' expert testimony that Bendectin was a teratogen was inadmissible. Instead, the court held that their evidence was insufficient to sustain a verdict absent "statistically significant epidemiological proof that Bendectin causes limb reduction defects."884 F.2d at 167.
The court purported to base its decision on a critical analysis of the reasoning of plaintiffs' experts but it did not explain the basis for its holding that statistically significant epidemiological results were required to sustain a verdict in plaintiffs' favor. The court emphasized that it "did not wish [its decision] to stand as a bar to future Bendectin cases in the event that new and statistically significant studies emerge that would give the jury a firmer basis upon which to determine the issue of causation." 884 F.2d at 167.
Despite the holdings of these courts of appeals, Judge Rubin, who presided over the multi-district common issues trial, recently denied a motion for summary judgment filed by Merrell Dow alleging that a group of consolidated Bendectin cases should be dismissed because the plaintiffs failed to produce a new published study supporting their assertion that Bendectin caused their birth defects.Judge Rubin denied the motion because he found a division in the scientific community as to whether epidemiological evidence was the only type of evidence that could reliably link Bendectin use to an increased risk of birth defects, and refused to substitute his judgment for experts in the relevant fields or to decide, instead of the jury, which view was the more reasonable. Thus, he denied Merrell Dow's assertion that the plaintiff's expert evidence, which was based on epidemiological evidence as well as structure activity analysis, and in vitro and in vivo studies, was inadmissible or insufficient to create a genuine issue of material fact. In re Bendectin Products Liability Litigation, 732 F.Supp. 744 (E.D.Mich.1990); see also Brock, 884 F.2d at 168 (Reavley, J. dissenting, along with five other judges, from the denial of a motion for rehearing in banc and criticizing the majority opinion in Brock).
We understand and sympathize with the concerns expressed in Brock over the costs and inequities that flow from inconsistent outcomes in Bendectin cases, the potential 952*952 effect erroneous verdicts have on the availability of useful medicines, and the wastefulness of continued reconsideration of an identical scientific issue in the courts. We are also troubled, as were the courts in Lynch and Richardson, by the potential for abuse that exists whenever an expert is permitted to testify to an opinion that is based upon reasoning and data that have not been subjected to the review of professional colleagues. This concern is naturally heightened when an expert is testifying on behalf of a plaintiff as sympathetic as a child crippled by serious birth defects.
However, our concern over these issues is tempered by our recognition that we do not have the authority to create special rules to address the problems posed by continued Bendectin litigation.Principles of issue preclusion have not developed to the point where we may bind plaintiffs by the finding of previous proceedings in which they were not parties, even by a proceeding as thorough as the multidistrict common issues trial.Lynch, 830 F.2d at 1192-93; In re Bendectin Products Liability Litigation, 732 F.Supp. at 746-48 (plaintiffs could not be bound to the results of the multi-district litigation common issues trial where (1) they had no direct financial or proprietary interest in the outcome of the trial and (2) they had no effective control over the theories or proofs advanced in that trial). Moreover, we may not manipulate our interpretation of the Federal Rules of Evidence to exclude expert testimony that on the record before us may satisfy normal standards of admissibility. Nor are we at liberty, especially in a case to be decided under our diversity jurisdiction, to impose different burdens of proof on Bendectin plaintiffs than those that would apply in analogous products liability suits. At the same time, however, we must require that Bendectin plaintiffs carry the evidentiary burdens imposed upon other plaintiffs. That is, plaintiffs must produce admissible evidence from which a jury could, applying the requisite burden of proof, reasonably find that their injuries were caused by Bendectin.
On a typical summary judgment motion in a Bendectin case, a court's task is essentially two-fold: (1) to scrutinize the admissibility of the plaintiff's expert testimony under the Federal Rules of Evidence, and (2) to measure what is admissible against the appropriate state law standard governing causation to determine whether summary judgment is appropriate. We address these issues in turn.
II. THE ADMISSIBILITY ISSUES
A. Rule 703
Federal Rule of Evidence 703 provides:
The facts or data in the particular case upon which an expert bases an opinion or inference may be those perceived by or made known to the expert at or before the hearing.If of a type reasonably relied upon by experts in the particular field in forming opinions or inferences upon the subject, the facts or data need not be admissible in evidence.
Rule 703 has a narrow function; it seeks to delimit the acceptable bases for expert testimony. We have read Rule 703, in conjunction with Rule 104(a), as requiring the district court to "make a factual inquiry ... as to what data experts in the field find reliable."In re Japanese Electronic Products Antitrust Litigation, 723 F.2d 238, 276 (3d Cir.1983), rev'd on other grounds, 475 U.S. 574, 106 S.Ct. 1348, 89 L.Ed.2d 538 (1986); see also 3 J. Weinstein & M. Berger, Weinstein's Evidence ¶ 703, at 703-16 (1988).
In performing this task, the district court must remain mindful that "[t]he proper inquiry is not what the court deems reliable, but what experts in the relevant discipline deem it to be."Id. at 276; see also Indian Coffee Corp. v. Procter & Gamble, 752 F.2d 891, 895 (3d Cir.), cert. denied, 474 U.S. 863, 106 S.Ct. 180, 88 L.Ed.2d 150 (1985). Further, we have noted that if an expert avers that his testimony is based on data experts in the field rely upon, then Rule 703's requirements are generally satisfied.Id. at 277. This reflects our recognition that Rule 703 was designed to broaden and liberalize the permissible bases for 953*953 expert testimony.Id.; see also Fed.R.Evid. 703 advisory committee's note.
In the present case, the district court purported to apply the correct standard. However, its cursory ruling that Done's testimony was inadequate under Rule 703 does not comply with the standard set forth in Japanese Products Litigation, as it was not predicated upon a record-supported, factual finding that Done relied upon identified data not regarded as reliable by experts in the field. Instead, the analysis in the district court's opinion referred only to Dr. Done's qualifications and the case law we have previously discussed indicating that the testimony of Dr. Done, or similar testimony, is inadmissible under Rule 703.
The district court appeared to discard Dr. Done's reanalysis of the available epidemiological evidence in part because he is not an epidemiologist. This was improper given Merrell Dow's concession that Dr. Done was qualified to interpret epidemiological data. It was also erroneous because an objection to Dr. Done's qualifications should be analyzed under Rule of Evidence 702, not Rule 703. Given the liberal criteria that governs the expertness inquiry, e.g., Habecker v. Copperloy Corp., 893 F.2d 49, 51-53 (3d Cir.1990), it is doubtful whether an expert with Dr. Done's credentials could be precluded from testifying about his interpretation of epidemiological evidence simply because he does not have a degree in epidemiology.
Putting aside the substantial question of whether the records in the prior Bendectin cases were materially different from the record here, these prior judicial opinions cannot sustain the district court's ruling because they do not address the question of whether reasonable experts would rely upon the epidemiological data Dr. Done bases his opinion on; rather, they primarily turn on the failure of that data to show a "statistically significant" link between Bendectin and an increased incidence of birth defects, and on the weight of scientific opinion contrary to Dr. Done's view that Bendectin is a teratogen.
While these factors may not be irrelevant to another type of challenge to Dr. Done's testimony, as we discuss hereafter, we do not view the absence of statistically significant findings or the great weight of contrary opinion as being relevant to the Rule 703 question posed here. Rule 703 is satisfied once there is a showing that an expert's testimony is based on the type of data a reasonable expert in the field would use in rendering an opinion on the subject at issue; it does not address the reliability or general acceptance of an expert's methodology.When a statistician refers to a study as "not statistically significant," he is not making a statement about the reliability of the data used, rather he is making a statement about the propriety of drawing a particular inference from that data.
At oral argument, counsel for Merrell Dow conceded that Merrell Dow had not specifically challenged the data Dr. Done relied upon. Indeed, with respect to most of Dr. Done's data, Merrell Dow is hardly in a position to claim that it is not of a type reasonably relied upon by experts in the field since Merrell Dow's expert relied upon the same epidemiological data from the published literature in formulating her opinion.To the extent Merrell Dow wishes to challenge particular sets of data Dr. Done has used, it is free to do so on remand. However, it has not attempted to show that Dr. Done's reliance upon particular epidemiological data is unreasonable, and the DeLucas had no burden to address arguments not made. Cf. Dowling v. United States, ___ U.S. ___, 110 S.Ct. 668, 673 n. 3, 107 L.Ed.2d 708 (1990) ("That the burden is on the introducing party to establish relevancy, does not also require the introducing party to anticipate and rebut possible objections to the offered evidence.").
954*954 Implicit in the district court's decision, and in the decisions in Richardson and Lynch, is the principle that Rule 703 requires an expert to accept the conclusions reached by the authors of studies if the expert wishes to utilize the data underlying those studies as a basis for testimony. However, the Federal Rules of Evidence contain no requirement that an expert's testimony be based upon reasoning subjected to peer-review and published in the professional literature. Indeed, while Brock expressed a distrust of expert testimony based on reasoning not subjected to such scrutiny, it expressly declined to hold that this was, in itself, a sufficient reason to exclude it. 874 F.2d at 313; Cf. Richardson, 857 F.2d at 831 & n. 55; Lynch, 830 F.2d at 1195 (each expressing doubt over testimony not grounded in peer-reviewed literature but not excluding it on that ground). We thus conclude that the present record provides no basis for excluding Dr. Done's testimony under Rule 703.
B. Rule 702
While Merrell Dow has not challenged the reliability of specific data utilized by Dr. Done, it has challenged before us the way in which he has used his data on a number of grounds, each of which it is free to pursue on remand. As we have noted, Merrell Dow's principal emphasis in this regard has been its insistence that an expert opinion based on epidemiological data and analysis is not admissible unless the data "disprove" the null hypothesis that Bendectin is not a teratogen at a .05 level of "statistical significance". This argument presents an issue of first impression in this circuit. We conclude that it should be evaluated under Rule 702 and in accordance with the teachings of United States v. Downing, 753 F.2d 1224 (3d Cir.1985).
Rule 702 provides:
If scientific, technical or other specialized knowledge will assist the trier of fact to understand the evidence or to determine a fact in issue, a witness qualified as an expert by knowledge, skill, experience, training, or education, may testify thereto in the form of an opinion or otherwise.
Rule 702 authorizes the admission of expert testimony so long as it is rendered by a qualified expert and is helpful to the trier of fact.American Technology Resources v. United States, 893 F.2d 651, 655 (3d Cir.1990); Habecker, 893 F.2d at 51; Breidor v. Sears, Roebuck & Co., 722 F.2d 1134, 1138-39 (3d Cir.1983). While no Federal Rule of Evidence specifically addresses the methodological fundamentals for expert testimony, Rule 702's helpfulness requirement implicitly contains the proposition that expert testimony that is based on unreliable methodology is unhelpful and therefore excludable. Downing, 753 F.2d 1224.
The reliability of expert testimony founded on reasoning from epidemiological data is generally a fit subject for judicial notice;epidemiology is a well-established branch of science and medicine, and epidemiological evidence has been accepted in numerous cases.See In re Agent Orange Product Liability Litigation, 611 F.Supp. 1223, 1243 (acceptability of reasoning from epidemiological evidence susceptible of judicial notice) & 1255 (citing cases admitting epidemiological evidence), aff'd, 818 F.2d 187 (2d Cir.1987). Thus, to the extent that Dr. Done's testimony is based on traditional epidemiological methodology, Rule 702 does not require its exclusion since his qualifications were stipulated to and his testimony goes to a critical issue in the case, cause-in-fact.
To the extent that the reliability of Dr. Done's mode of analysis is not susceptible of judicial notice, i.e., deviates from that which has consistently been admitted into evidence, however, the district court on remand must conduct a hearing and analysis consistent with the counsel provided in Downing. In that case this court articulated a flexible test for addressing contentions that expert testimony based on arguably unreliable techniques were "unhelpful" and thus inadmissible under Rule 702:
Rule 702 requires that a district court ruling upon the admission of (novel) scientific evidence, i.e. evidence whose scientific fundaments are not suitable candidates for judicial notice, conduct a preliminary 955*955 inquiry focusing on (1) the soundness and reliability of the process or technique used in generating the evidence, (2) the possibility that admitting the evidence would overwhelm, confuse, or mislead the jury, and (3) the proffered connection between the scientific research or test result to be presented, and particular disputed factual issues in the case.
Id. at 1237. The "fit" between Dr. Done's tendered testimony and the crucial causation issues in this case is a good one and the third Downing factor thus cuts in favor of its admissibility. It is the other factors, reliability and jury reaction, that the district court will need to address if Merrell Dow litigates this issue on remand.
In Downing, we explicitly rejected reliance upon the "general acceptance" test of admissibility, most prominently articulated in Frye v. United States, 293 Fed. 1013 (D.C.Cir.1923). We did so, for among other reasons, because the general acceptance test was too vague and malleable to yield consistent results, and because its nose-counting emphasis often led to the exclusion of helpful evidence in contradiction to the spirit of the Federal Rules of Evidence.753 F.2d at 1236-37. Thus, under Downing, Dr. Done's opinion cannot be excluded simply because the weight of scientific opinion leans against him. At the same time, however, the degree to which contrary opinion dominates the relevant literature is not wholly irrelevant to the reliability inquiry mandated by Downing. Id. at 1238.
We stress at the outset that the confidence level or "significance" of a statistical analysis is but a part of a meaningful evaluation of its reliability.See generally, J. Monahan & L. Walker, Social Science in Law: Cases and Materials 33-75 (1990). The results of such a study may fail to correspond to reality for a number of reasons other than "sampling error." Faulty data collection resulting from design or execution flaws, for example, can create a much greater risk of error than the sampling error. Thus, a poorly conceived or conducted study that disproves the null hypothesis at a .01 level of significance may be far less reliable than a well conceived and conducted study that is significant at a .1 level. Kaye, Is Proof of Statistical Significance Relevant?, supra at 1362. As a result, any assessment of reliability under Section 702 should be conducted with an eye to all the risks of error posed by the proffered evidence.
By directing such an overall evaluation, however, we do not mean to reject at this point Merrell Dow's contention that a showing of a .05 level of statistical significance should be a threshold requirement for any statistical analysis concluding that Bendectin is a teratogen regardless of the presence of other indicia of reliability. That contention will need to be addressed on remand. The root issue it poses is what risk of what type of error the judicial system is willing to tolerate. This is not an easy issue to resolve and one possible resolution is a conclusion that the system should not tolerate any expert opinion rooted in statistical analysis where the results of the underlying studies are not significant at a .05 level.We believe strongly, however, that this issue should not be resolved in a case where the record contains virtually no relevant help from the parties or from qualified experts.The literature 956*956 evidences that there are legal scholars and epidemiologists who have given considerable thought to this and related issues and we would hope that this expertise could be made available to the court, on remand, in some acceptable manner.
Whatever resources are made available to the district court on remand, they should be utilized with a sensitivity to the relevant policy judgments reflected in the Federal Rules of Evidence.Those rules embody a strong and undeniable preference for admitting any evidence having some potential for assisting the trier of fact and for dealing with the risk of error through the adversary process.Barefoot v. Estelle, 463 U.S. 880, 899 & 901 n. 7, 103 S.Ct. 3383, 3397 & 3398 n. 7, 77 L.Ed.2d 1090 (1983); Downing, 753 F.2d at 1241 (germane to the decision to whether to admit novel scientific testimony is the presumption of helpfulness accorded expert testimony under F.R.Evid. 702); 3 Weinstein's Evidence, supra, ¶ 702-.Thus, Rules 401 and 402 relating to relevance, and Rule 403 relating to undue prejudice, and Rules 701-703 relating to expert testimony provide for the admission of evidence with any marginal utility absent a substantial countervailing concern.
In considering the question of reliability on remand, the district court is permitted to identify relevant scientific communities and make determinations about the degree of acceptance of Dr. Done's methodology within those communities. Id. at 1238. Conversely, it may consider the extent to which members of these communities decline to give any weight to inferences not supported by .05 statistical significance. The district court should keep in mind, however, that the ultimate touchstone is helpfulness to the trier of fact, and with regard to reliability helpfulness turns on whether the expert's "technique or principle [is] sufficiently reliable so that it will aid the jury in reaching accurate results." 3 Weinstein's Evidence, 957*957 supra, ¶ 702, at 702-35.The fact that a scientific community may require a particular level of assurance for its own purposes before it will regard a null hypothesis as disproven does not necessarily mean that expert opinion with somewhat less assurance is not sufficiently reliable to be helpful in the context of civil litigation.
Even if it is found that Done's testimony meets the test of reliability, Downing recognizes that special dangers are posed by scientific evidence. Thus, the district court will be required to consider the "possibility that admitting the evidence could overwhelm, confuse, or mislead the jury." Downing, 753 F.2d at 1237. This inquiry focuses on the extent to which probative scientific evidence is capable of being properly utilized by the jury: will the jury be able to give it appropriate weight or will the evidence, because of its scientific origins, take on an importance beyond its probative value?The degree to which Dr. Done's testimony is susceptible of being understood by, rather than overwhelming, the jury, and the usefulness of cross-examination, competing expert testimony, and judicial control in this regard, factor into this calculus.
After considering the reliability of Dr. Done's testimony and the dangers it poses, the district court will have to reach the ultimate determination of whether it is "helpful" and thus admissible. That determination will require an exercise of discretion informed by the teachings of Downing and the record developed on remand. Once made, it will be entitled to deference. United States v. Ferri, 778 F.2d 985, 989-91 (3d Cir.1985) (finding no abuse of discretion where a district court's decision to admit arguably novel scientific evidence resulted from a rational application of the relevant criteria set forth in Downing to the record before it).
Merrell Dow is free to trigger this inquiry on remand by contending that Dr. Done's testimony is based on unreliable epidemiological methodology. But on the present record, we cannot by reference to Rule 702 affirm the district court's exclusion of that testimony. Dr. Done's qualifications were stipulated for the purposes of Merrell Dow's motion, his testimony goes to the crucial issue of causation, and his analysis purports to be based on a theory of epidemiological reasoning that has support in the published literature. Given these facts, we are unwilling in the absence of countervailing evidence or persuasive argument to conclude that his testimony would be unhelpful under Rule 702.
C. Rule 403
Merrell Dow urges upon as an alternative ground for affirmance the exclusion of Done's testimony under Federal Rule of Evidence 403. The district court declined to rely on this ground and we could not exclude Done's testimony under Rule 403 on the present record. Moreover, if Done's testimony survives the rigors of Rule 702 and 703 on remand, Rule 403 is an unlikely basis for exclusion. Downing, 753 F.2d at 1243 n. 27.
III. THE SUFFICIENCY OF THE EVIDENCE ISSUE
Since the district court held that the Deluca's sole evidence of causation was inadmissible, it had no difficulty in concluding that they had not met their burden under the Celotex trilogy to produce evidence sufficient to raise a genuine issue of material fact as to whether Amy DeLuca's birth defects were caused by Bendectin.Celotex Corp. v. Catrett, 477 U.S. 317, 106 S.Ct. 958*958 2548, 91 L.Ed.2d 265 (1986); Anderson v. Liberty Lobby, 477 U.S. 242, 106 S.Ct. 2505, 91 L.Ed.2d 202 (1986); Matsushita Electric Industrial Co. v. Zenith Radio Corp., 475 U.S. 574, 106 S.Ct. 1348, 89 L.Ed.2d 538 (1986). If Dr. Done's testimony is ultimately held to be admissible, however, a different issue will be presented. While we express no opinion on that issue, we wish to make clear that nothing in this opinion is intended to suggest that this issue is or is not susceptible of resolution by summary judgment.
As we have earlier noted, a court presented with a motion for summary judgment must ultimately determine whether the admissible evidence tendered by the party having the burden of proof on an issue is sufficient to permit a rational factfinder to find for that party on that issue under the appropriate burden of proof.Anderson, 477 U.S. at 252, 106 S.Ct. at 2512 (in a run of the mill civil case, the judge must ask on a motion for summary judgment whether "reasonable jurors could find by a preponderance of the evidence that the plaintiff is entitled to a verdict"). In the present context, Dr. Done's testimony may be found sufficiently helpful to be admissible and sufficiently probative to support a jury finding that Bendectin can cause birth defects or even that Bendectin not infrequently causes such defects. However, assuming that New Jersey would apply the traditional "more probable than not" burden of proof standard to the causation issue in this case, this admissible testimony would not alone bar summary judgment for Merrell Dow unless it would support a jury finding that Bendectin more likely than not caused the birth defects in this particular case.
Hypothetically, Dr. Done may be able to testify, on the basis of adequate data and the application of reasonably reliable methodology, for example, that of women who took Bendectin and had children with birth defects, 25% of the cases of birth defects can be attributed to Bendectin exposure. This testimony would be admissible as it would be a basis from which a jury could rationally find that Bendectin could have caused Amy DeLuca's birth defects; however, it would not without more suffice to satisfy the DeLucas' burden on causation under a more likely than not standard since a fact finder could not say on the basis of this evidence alone that Amy DeLuca's birth defects were more likely than not caused by Bendectin.
If New Jersey law requires the DeLucas to show that it is more likely than not that Bendectin caused Amy DeLuca's birth defects, and they are forced to rely solely on Dr. Done's epidemiological analysis in order to avoid summary judgment, the relative risk of limb reduction defects arising from the epidemiological data Done relies upon will, at a minimum, have to exceed "2":
A relative risk of "2" means that the disease occurs among the population subject to the event under investigation twice as frequently as the disease occurs among the population not subject to the event under investigation.Phrased another way, a relative risk of "2" means 959*959 that, on the average, there is a fifty per cent likelihood that a particular case of the disease was caused by the event under investigation and a fifty per cent likelihood that the disease was caused by chance alone.A relative risk greater than "2" means that the disease more likely than not was caused by the event.
Manko v. United States, 636 F.Supp. 1419, 1434 (W.D.Mo.1986), aff'd in relevant part, 830 F.2d 831 (8th Cir.1987).
We express no opinion on whether Dr. Done's epidemiological analysis fails to meet this threshold requirement. While it is not clear to our untrained eyes that it does, without the benefit of an expert affidavit critiquing that analysis we are not sufficiently confident of our own critical capacities to resolve that issue. Nor do we suggest that the DeLucas will be required to rely solely on Dr. Done's epidemiological analysis at trial or in any subsequent summary judgment proceedings. The alternative support that he finds for his conclusion in structural activity analysis, for example, may be entitled to some weight in determining whether they have met their burden of establishing a prima facie case. We note only that even if Dr. Done's epidemiological analysis is found to be admissible, the DeLucas are entitled to get to trial only if the district court is satisfied that this analysis together with any other evidence relevant to the causation issue would permit a jury finding that Amy's birth defects were, when measured against the appropriate burden of proof, caused by her mother's exposure to Bendectin.
We hold that the present record cannot sustain the exclusion of Dr. Done's testimony. Therefore, we will reverse the grant of summary judgment in Merrell Dow's favor and remand for further proceedings consistent with this opinion.
[*] Honorable Robert F. Kelly, United States District Judge for the Eastern District of Pennsylvania, sitting by designation.
 Summary judgment was entered on behalf of the captioned defendants, other than Merrell Dow, prior to the removal of this case to the district court. The DeLucas do not seek review of the order dismissing their action against these defendants.
 At oral argument before this court, Merrell Dow took the position that certain data relied upon by Done is inadequate, and that his method of interpretation is novel and unreliable. Neither of these contentions was explicitly raised below; most importantly, there is no expert evidence supporting either assertion in the record.
 To this end, Merrell Dow presented the following evidence in support of its motion for summary judgment: (1) epidemiological studies concluding there is no relationship between Bendectin and an increase in birth defects; (2) copies of published and unpublished opinions in which various courts excluded and/or held insufficient opinion testimony concluding that Bendectin is a teratogen; (3) the affidavit of Dr. Pauline Brenholz, a board certified geneticist and cytogeneticist but not an epidemiologist, who opined that, based on the existing knowledge, "it is absolutely impossible to conclude that it is `more probable than not' that Bendectin causes birth defects, or that a causal relationship has been shown between Bendectin and birth defects to a `reasonable degree of medical certainty.' Indeed, the impressive and overwhelming weight of scientific authority is to the contrary." App. at 199; and (4) copies of various FDA documents opinion that there is no demonstrated link between Bendectin and an increased risk of birth defects.
 At oral argument, counsel for Merrell Dow attempted to rely upon Dr. Done's failure to rule out other possible causes for Amy's birth defects as a ground for affirmance. However, this was not a contention advanced below or in the briefs to this court.
 See also Hall & Silbergeld, Reappraising Epidemiology: A Response to Mr. Dore, 7 Harv.Envtl.L.Rev. 441, 443 (1983) ("An epidemiological study reveals the correlation between some factor and a significant excess in the number of deaths or injury above that which would otherwise have occurred — that is, above normal background levels.").
 See Dore, A Commentary on the Use of Epidemiological Evidence in Demonstrating Cause-in-Fact, 7 Harv.Envtl.L.Rev. 429, 436 (1983) ("Epidemiological evidence, like other generalized evidence, deals with categories of occurrences rather than particular individual occurrences.Epidemiological studies address questions such as `Does exposure to this chemical increase the incidence of cancer in a population?' but not `Did exposure to this chemical cause a particular person's cancer?'"); Note, Causation in Toxic Torts: Burdens of Proof, Standards of Persuasion, and Statistical Evidence, 96 Yale L.J. 376, 380 (1986).
 Dr. Done served as a Special Assistant to the Director for Pediatric Pharmacology of the FDA's Bureau of Drugs from 1971 to 1975. In this role, Done aided in the provision "of FDA input on research involving children and fetuses, and development of guidelines for pre-clinical safety evaluations of drugs for use in children and in pregnancy...." App. at 238. He also participated in publishing a paper called "General Guidelines for the Evaluation of Drugs to be Approved for Use during Pregnancy and for the Treatment of Infants and Children," in conjunction with the American Academy of Pediatrics in 1974. App. at 257.
 The conclusion Dr. Done draws from these studies is buttressed by inferences he draws from less probative, but nevertheless relevant, sources. These include analogies between the effect substances with a chemical structure similar to Bendectin have on human fetuses and the effect Bendectin may have, and inferences drawn from studies of the incidence of birth defects in the offspring of animals given Bendectin during pregnancy, in vivo studies, and the effect Bendectin has on animal fetuses outside an animal host, in vitro studies.
The animal studies Dr. Done usually relies upon to support his opinion were held in this case to be inadmissible in and of themselves, and also incapable of serving as a foundation for his testimony. The DeLucas have not challenged this ruling on appeal, however. While we, therefore, will not decide this issue, we note that an exhibit to an affidavit by Merrell Dow's medical expert indicates that it is supportive, though not a necessary prerequisite, of a conclusion that a substance is a human teratogen that it is has been shown to exhibit "teratogenicity in experimental animals." App. at 206. Relevant literature suggests that experts would not ignore this type of evidence if it existed, though they would give far greater weight to the human epidemiological evidence. Black & Lilienfeld, Epidemiological Proof In Toxic Tort Litigation, 52 Fordham L.Rev. 732, 762 (1984);Hall & Silbergeld, supra, at 443-45.
Structure activity analysis is based on the hypothesis that drugs with similar chemical structures may be expected to act in similar ways. Done alleges that drugs containing structures like Bendectin's have been associated with a higher incidence of birth defects. He infers from this that Bendectin will also be associated with an increased incidence of birth defects, though he concedes that structure activity considerations may only properly serve as a basis for concluding that greater study of Bendectin's teratogenicity is needed. The district court did not hold that Done's structural activity analysis was inadmissible. It held only that this analysis was alone insufficient to carry the DeLucas' burden on causation. The DeLucas do not contend otherwise.
 There are other types of error that can affect the soundness of statistical studies. See D. Barnes & J.M. Conley, Statistical Evidence in Litigation, § 6.F-6.19 (1986 & 1989 Supp.). The parties have not focused on the particular types of error that can affect the validity of an epidemiological study, nor have they focused on any other indication of the possibility of error in a study other than that provided by the use of statistical significance testing and confidence intervals.
 These courts also rejected Dr. Done's attempt to use structure activity analysis and in vivo and in vitro studies to support his opinion that Bendectin is a teratogen: "Studies of this kind, singly or in combination, are not capable of proving causation in human beings in the face of the overwhelming body of contradictory epidemiological evidence."Richardson, 857 F.2d at 830; accord Lynch, 830 F.2d at 1194; see also In re Agent Orange Product Liability Litigation, 611 F.Supp. 1223, 1234 (E.D.N.Y.1985) (where a number of sound epidemiological studies had been conducted on the health effects of Agent Orange there was no other reasonable basis for expert testimony on the subject), aff'd, 818 F.2d 187 (2d Cir.1987).
 We note that the decision in Brock is considered by the Fifth Circuit to be an exception to that court's general refusal to require plaintiffs in toxic tort cases to present statistically significant epidemiological data to show causation. In a recent case involving a worker who was killed by cancer allegedly caused by his workplace exposure to nickel and cadmium, the Fifth Circuit held that an expert opinion as to causation unsupported by such evidence was improperly excluded as unreliable. In the course of so ruling, the Fifth Circuit expressly indicated that Brock was to be narrowly construed:
Legal standards of proof have not yet reached the point where a toxic tort plaintiff may establish causation only through statistically significant epidemiological studies, thereby rendering unreliable expert testimony based on anything except such studies.... [W]e will not extend Brock. We simply apply this court's traditional position that an expert's opinion need not be generally accepted before it can be sufficiently reliable and probative to be submitted to the jury and perhaps support a jury finding.
Christopherson v. Allied-Signal Corp., 902 F.2d 362, 367 (5th Cir.1990) (citations and brackets omitted).
 He is making a statement about the degree to which the relationship found in the data may be due to chance, but his decision to use a certain significance level as a check on the permissible inference to be drawn from the data is a methodological value judgment which is separate from the question of whether the data is of the type an expert would rely upon.
 While it is true that Downing can be read as applying only to so-called "novel" scientific evidence, see 753 F.2d at 1237, this is an unduly restricted reading of the opinion. The importance of Downing is in the framework it provides for analyzing claims that proffered scientific evidence is insufficiently trustworthy to be admissible. Where the "helpfulness" of expert testimony cannot be resolved via judicial notice, Downing sets forth the test for assessing the admissibility of the testimony.
 In this respect, Rules 702 and 703 intersect. If a study's method of data collection is faulty, it may well be that no expert would rely upon the data generated as a basis for drawing any inference about the studied subject.
 We, of course, do not suggest that where a proffered statistical analysis has no novel aspect, a Rule 702 hearing is necessary to evaluate the reliability of the study design and execution. Ordinarily, these matters may be pursued through cross-examination and countervailing expert testimony before the finder of fact.
 A sampling of the pertinent law review literature includes Black & Lilienfeld, supra; Black, A Unified Theory of Scientific Evidence, 56 Fordham L.Rev. 595 (1988); Cohen, Conceptualizing Proof and Calculating Probabilities: A Response to Professor Kaye, 73 Cornell L.Rev. 78 (1987);Cohen, Confidence in Probability: Burdens of Persuasion in a World of Imperfect Knowledge, 60 N.Y.U.L.Rev. 329 (1985);Dore, supra; Hall & Silbergeld, supra; Kaye, Is Proof of Statistical Significance Relevant?, supra; Kaye, Statistical Inference in Litigation, 46 Law & Contemp.Probs. 13 (1983); Kaye, Apples and Oranges: Confidence Coefficients and the Burden of Persuasion, 73 Cornell L.Rev. 54 (1987); Nesson, Agent Orange Meets the Blue Bus: Factfinding at the Frontier of Knowledge, 66 B.U.L.Rev. 521 (1986).
 The parties may choose to retain and present experts and/or the court may retain its own expert. F.R.Evid. 706. Counsel may be able to help through the preparation of so-called "Brandeis briefs."
 To the extent Dr. Done's testimony is found by the court to be admissible, that testimony will, of course, be subject to cross-examination at trial. When an expert is found to be qualified and a determination is made, or judicial notice taken, that his methodology is sufficiently reliable to make the opinion admissible, the expert's qualifications, his methodology, and his application of the methodology remain appropriate subjects for cross-examination. Even though admissible, the proferred opinion may be in error, inter alia, because of the inherent risk of error inherent in the methodology, because the data or factual assumptions upon which the opinion rests are wrong, or because a mistake has been made in applying the methodology to the data or assumed facts. Thus, if Merrell Dow is required to go to trial, it may query Dr. Done about the "sampling error" inherent in his analysis, about the manner in which his data were gathered, and about any possible logical flaws in his reasoning. Payton v. Abbott Laboratories, 780 F.2d 147, 156 (1st Cir.1985) (weaknesses in the factual underpinnings of expert testimony that DES was a teratogen affected the weight and credibility of the testimony, not its admissibility). Thus, while a reliability assessment by the court (based on a hearing or prior judicial experience and judicial notice) plays a threshold role in determining the admissibility of expert opinion, the adversarial process and the finders of fact make the ultimate determination concerning the reliability of the opinions proffered. See, e.g., Barefoot v. Estelle, 463 U.S. 880, 899 & 901 n. 7, 103 S.Ct. 3383, 3397 & 3398 n. 7 (1983); Wilson v. Merrell-Dow Pharmaceuticals, 893 F.2d 1149, 1153-54 (10 Cir.1990); Ellis v. International Playtex, Inc., 745 F.2d 292, 304 (4th Cir.1984); 3 J. Weinstein and M. Berger, Weinstein's Evidence ¶ 703, at 703-32 (1988).
 Downing teaches that the frequency with which a scientific technique leads to erroneous results bears heavily on its reliability for evidential purposes.753 F.2d at 1239. In addressing whether Done's testimony is reliable, the district court may thus usefully explore the effect of Done's use of several studies, rather than one study, on the accuracy of his conclusions. This would involve consideration of Done's contention that more reliable inferences may be drawn from a number of studies that showed positive relationships between Bendectin and birth defects, than from one study that did meet the .05 level of significance.
 We have previously suggested that what is "helpful" for purposes of Rule 703 may be different in a criminal context than in a civil one. Downing, 753 F.2d at 1241 & n. 22. We have no occasion here to further pursue that suggestion.
 We reiterate that Merrell Dow did not contend that Done's testimony, if admissible, was insufficient to raise a genuine issue of material fact as to the issue of causation.
 This is a diversity case and the district court will be required to apply the burden of proof standard applicable under New Jersey law. See, e.g., Blair v. Manhattan Life Ins. Co., 692 F.2d 296, 299 (3d Cir.1982). The parties have not devoted their attention to the appropriate burden of proof under New Jersey law. The traditional New Jersey standard requires that a plaintiff show to a reasonable degree of medical probability that the defendant's conduct caused her injuries. See Johnesee v. Stop & Shop Cos., Inc., 174 N.J.Super. 426, 416 A.2d 956, 959 (App.Div.1980); cf. Thompson v. Merrell Dow Pharmaceuticals, 229 N.J.Super. 230, 551 A.2d 177 (App.Div.1988) (plaintiffs were required to prove that Bendectin caused their child's birth defects even though their claim was based on strict liability). New Jersey courts have typically held that if the plaintiff's injury would have occurred absent defendant's conduct, or in this case, exposure to defendant's product, then there is no cause-in-fact. Ostrowski v. Azzara, 111 N.J. 429, 545 A.2d 148, 153 (1988).
 See also Marder v. G.D. Searle & Co., 630 F.Supp. 1087, 1092 (D.Md.1986) ("In epidemiological terms, a two-fold increased risk is an important showing for plaintiffs to make because it is the equivalent of the required legal burden of proof — a showing of causation by the preponderance of the evidence or in other words, a probability of greater than 50%."), aff'd, 814 F.2d 655 (4th Cir.1987);Cook v. United States, 545 F.Supp. 306, 308 (N.D.Cal.1982) ("Whenever the relative risk to vaccinated person is greater than 2 times the risk to unvaccinated persons, there is a greater than 50% chance that a given GBS case among vaccinees of that latency period is attributable to vaccination, thus sustaining plaintiff's burden of proof on causation."); Black & Lilienfeld, supra, at 769 ("In no case ... can evidence suffice to establish a causal link if it does not include at least reasonable estimates of exposure levels and durations, and data that reasonably indicate a relative risk greater than 2.").
 Even if Dr. Done's statistical analysis is found to be admissible, its lack of statistical significance at the .05 level may appropriately play some role in deciding this subsequent issue. The relationship between confidence levels and the more likely than not standard of proof is a very complex one, however, and in the absence of more education than can be found in this record, we decline to comment further on it. For a discussion of these issues, see Cohen, Confidence in Probability: Burdens of Persuasion in a World of Imperfect Knowledge, supra; Cohen, Proof and Calculating Probabilities: A Response to Professor Kaye, supra; Kaye, Is Proof of Statistical Significance Relevant?, supra; Kaye, Statistical Inference in Litigation, 46 Law & Contemp.Probs. 13 (1983);Kaye, Apples and Oranges: Confidence Coefficients and the Burden of Persuasion, 73 Cornell L.Rev. 54 (1987); cf. Addington v. Texas, 441 U.S. 418, 423, 99 S.Ct. 1804, 1807, 60 L.Ed.2d 323 (1982) (the preponderance of the evidence standard reflects a judgment that the plaintiff and the defendant should "share the risk of error in roughly equal fashion");Santosky v. Kramer, 455 U.S. 745, 755, 102 S.Ct. 1388, 1395, 71 L.Ed.2d 599 (1982) (same); In re Winship, 397 U.S. 358, 370-72, 90 S.Ct. 1068, 1075-76, 25 L.Ed.2d 368 (1970) (Harlan, J., concurring) (same).