#* topic_description_raw:
#*   attr:
#*     fillcolor: '3'
#*   desc: Extract descriptions of topics from the model as probability distributions
#*     over all terms.
#*   ext: py
#*   inputs:
#*   - lda_model
#* 

# now we can get the topics!
# we mentioned earlier that the vectorizer has .fit() and .transform() methods, and it turns out that the LDA model also has .fit() (which we've seen)
# and .transform() (which we'll use soon), and that because these 'fit a model' 'use a model' steps are common across different parts of a machine learning analysis, 
# it makes sense to "chain" them together into a pipeline. That's what we produced with Model(Stage(ldamodel)) - a pipeline of models contained in stages; our
# pipeline just happens to only have one stage. Having pipelines of models in stages supports developing complex pipelines and reusing them or using them in 
# other apps in the enclave (like the "Modeling objective" app). 

# We're not really using those features though, we just needed to build the pipeline so we could pass our pyspark.ml model between transforms :)
# here we just get the model back out the pipeline and call its .describeTopics() method, which returns a data frame.

def topic_description_raw(lda_model):
    ldamodel = lda_model.stages[0].model

    topics_description = ldamodel.describeTopics(maxTermsPerTopic = 100000000) # default is 10; we can ask for up to the size of the full vocabulary

    # notice the output types - arrays!
    return topics_description

#################################################
## Global imports and functions included below ##
#################################################

import pickle

def to_pickle(data):
    output = Transforms.get_output()
    output_fs = output.filesystem()
    
    with output_fs.open('data.pickle', 'wb') as f: 
        pickle.dump(data, f)

def from_pickle(transform_input):
    with transform_input.filesystem().open('data.pickle', 'rb') as f:    
        data = pickle.load(f)

    return data