Analytics dashboard - wordcloud generation (w/ alternatives)
Current implementation status
Basic wordcloud is implemented on branch dashboard-popular-searches
.
Wordcloud image generation is implemented using magic_cloud.
Keyword extraction is implemented by grouping the words in search queries by their frequency of appearance, then taking the top 10. Note: alternatives to keyword extraction are considered in the section below.
Detailed description
The Moderator / Administrator analytics dashboard contains a visualisation of What is being searched for? (clustering?)
(second point from issue #11). This visualisation is created by extracting keywords from searches of a certain time period (e.g. past week, past 10 days, etc.). These keywords are shown in a wordcloud representation like here.
There are several approaches / alternatives to the keyword extraction:
-
Basic Version - current implementation; could be improved with regex and custom stopwords.
-
KeyBERT - was tested with longer and shorter search queries generated by ChatGPT and was able to extract keywords fairly accurately - Jupyter Notebook here. However, the shorter queries were generated from the longer ones (which were in turn generated from the platform's template content) with ChatGPT and could mean that this test (partly) validates ChatGPT's ability to extract keywords from questions and not necessarily KeyBERT's abilities.
- implementation is in Python, will need with a library (e.g. pycall) to call Python functions from Ruby. OR a separate Python environment.
- use
KeyBERT(model='paraphrase-multilingual-MiniLM-L12-v2')
for multi-lingual support (including Dutch)
Conclusion for now (May 2023): basic version is enough until the platform is developed further.