BACKGROUND AND PURPOSE: The authors conducted the study to evaluate the incompleteness of follow-up as well as the validity of the diagnostic code in the medical insurance databases in a cohort study. They also suggested several useful regression models for the analysis of such incomplete data. METHODS: The subjects of Seoul Cohort(n=14,533) were followed up for three and a half years. Based on the chart reviews of the subjects who had the diagnostic code of gastric cancer in the medical insurance databases, forty-four cases of gastric cancer were idenfified, using cancer registry databases and death certificates as the secondary source. Regression coefficients and the associated p-values were estimated using the following six methods and the results were compared with each other. Method 1: The subjects with the diagnostic code in the medical insurance databases were considered as the cases of gastric cancer. Method 2: The confirmed cases were considered as the cases of gastric cancer. Method 3: The cases were the subjects with the diagnositc code whose diagnosis was confirmed by medical chart reriew. Method 4: Ordinal logistic regression. Method 5: Weighted logistic regression. Method 6: Polytomous logistic regression RESULTS: A total of 12,541 subjects were followed up excluding censored cases. One hundred and nine subjects were diagnosed with gastric cancer in the medical utilization databases: forty-three were probable cases whose dianosis was not confrimed by chart review, twenty-six were ruled out and 26 were confirmed cases. Another 14 cases were confirmed using the cancer registry and death certificates. Using the secondary sources, four another cases were confirmed and 44 cases were confirmed during follow-up. In method 1, past history of gastritis and gastric ulcer was significant risk factor whereas intake frequency of fresh vegetable, ice cream and coffee was associated with significantly decreased risk. In the second and the sixth method, green tea was a significant protective factor, whereas in methods 3-5, no significant variables were found. CONCLUSIONS: Polytomous logistic regression was the preferred method in the cohort study using secondary sources of information for the follow-up, and it provided additional information for the risk factor identification, especially for the specificity of the risk factors.