Data Science in Spain: knowledge and public perception of big data and artificial intelligence
PHASE 1: Public understanding of Data Science.
In this phase we use surveys to discover the understanding and attitudes of society toward DS. In different fields, knowledge about and attitudes toward different aspects, including science, has been often studied using this quantitative method (Bauer, 2008) ; however, to the best of our knowledge, there are no studies that analyse and compare the knowledge and attitudes towards Data Science in Spain. That is why DataScienceSpain will measure that in two different moments, what will also allow to verify the evolution in time during the six months that will go between both surveys. The first wave of the survey will take place in the third month of the project, while the second one will be distributed in month 9; afterwards, the results of both waves will be compared. In both phases a quota system will be implemented, what will ensure an adequate and representative distribution in dimensions such as gender, age and region.
PHASE 2. Knowledge of journalists about Data Science.
A more qualitative approach will be adopted, and qualitative in depth and reconstruction interviews will be used to understand the challenges that journalists face when reporting about Data Science, Big Data and Artificial Intelligence. Ten scientific journalists will be interviewed and, additionally, two of those that have previously reported about DS will be chosen for a reconstruction interview. This kind of interviews allow the interviewer to observe in detail the process the interviewed person (here, a journalist) follows in a certain process (here, the production of a Data Science, Big Data or Artificial Intelligence journalistic piece). All these interviews will show the challenges and deficiencies that journalists face when dealing with DS information, both in their daily work and in relation with their audiences. A gender equilibrium is expected, and at least 40% of the interviewed journalists will be women.
PHASE 3. Data Science in the media.
Beside the knowledge about journalists and citizens, in order to discover how Data Science is depicted in the media it is necessary to study the media contents. The great amount of contents published or distributed in different media demand the use of computational methods to collect and analyse all that information. That is why a sample of the online sites of media will be selected (including native digital media, and the websites of traditional press, radio stations or television broadcasters), together with a list of keywords, so that we can create a set of scripts (scraping, connection to APIs, etc.) that can automatically collect all contents related to Data Science. These contents, collected at the same time as the first survey and the interviews, will be automatically analysed using natural language processing and machine learning, studying and comparing formal characteristic, topics (using topic modelling) and sentiments (based on a dictionary). Once a training corpus is built with the collected sample, we will use machine learning techniques to measure the comprehensibility of the contents (a key feature so that society can understand new and complex aspects related with DS). With that aim, different algorithms will be used (Naïve Bayes, Logistic regression, SVM, kNN, decision trees, random forest or neuronal networks) to generate and evaluate models using an initial corpus with examples classified by people (using an ad-hoc comprehensibility scale peer-validated and with inter-coder reliability) and using standard evaluation measures (accuracy, recall, AUC, etc.).
 Bauer, M. W. (2008). Survey research and the public understanding of science. In Handbook of public communication of science and technology, 125–144. Routledge.
In collaboration with:
Project funded by the Spanish Foundation for Science and Technology (FECYT) in the Call for grants to promote the scientific, technological and innovation culture 2019-2020. [FCT-18-13437]