--* INPUT_persons_datapartners_concepts: --* attr: --* fillcolor: '3' --* desc: Defines the input data for training the model; identifying a hold_count column --* (for entries to be ignored during training). Results in training and validation --* but no test data. --* ext: sql --* inputs: --* - person_all_facts --* - condition_era_full_clean --* -- first up: let's gather all of the different condition_concept_ids for each person into a single Array column... even though the function for it is called collect_list() -- notice the column type Array in the output preview -- we also define a subset to be labeled as hold-out; -- non-hold-out rows are used for model training -- hold-out rows are used for evaluation with coherence, -- both types are used for data visualizations (except coherence) SELECT /*+ COALESCE(14) */ condition_era_full_clean.person_id as person_id, condition_era_full_clean.data_partner_id as data_partner_id, string(condition_era_full_clean.condition_concept_id) as concept_id, Person_all_facts.set = 'validate' as hold_out FROM condition_era_full_clean INNER JOIN Person_all_facts ON Person_all_facts.person_id = condition_era_full_clean.person_id -- note: even though this is going to the model building, downstream code uses the -- hold_out column defined above to split into train/validate sets WHERE Person_all_facts.set = 'train' OR Person_all_facts.set = 'validate' -- GROUP BY person_id, data_partner_id