Fabiana Clemente is the Co-founder and Chief Data Officer at YData. YData is an AI startup that created the first data-centric development solution to combine data discovery, improvement and scale in one single platform.
What initially attracted you to AI and machine learning?
My background is in Applied Mathematics, where I add the opportunity to learn and understand how we can extract information out of data as well as doing it leveraging code. At that time it was not as sexy as Machine Learning but it was definitely what sparked my passion for the area.
Could you share the genesis story behind Ydata?
As a Data Scientist that has worked for both startups and enterprises, I had my fair share of struggles – sometimes the access to data was blocked under the premise of security or privacy, other times access was easy but the quality of the data was not even close to what was needed to build Ai-based solutions. Knowing that these struggles are very frequent in most organizations, inspired us to start the company with the goal to help these teams overcome these obstacles, by accelerating their AI development with improved data.
Could you describe for our audience what synthetic data is?
Synthetic data is considered to be any data that was not generated in the real-world so, any data that is created artificially. There are methods that enable the generation of synthetic data – from rule-based strategies all the way towards using Machine or Deep Learning models to learn those “rules” for us. At YData, we adopted and specialized in a Deep Learning-based strategy to generate new data that keeps the behaviour from real world events without concerns around privacy.
What makes synthetic data so important?
The more organizations realize the importance of data to boost their businesses the more the importance and role of synthetic data will be understood. Collecting real data is not only time consuming and expensive but also, sometimes, impossible. To be able to build AI applications, data is a hard requirement – here is where synthetic data comes to the rescue. The ability to generate unseen scenarios or to simply unlock the access to data, is key to evolve in a world where pioneers, like Andrew Ng, state that to become data-centric is key for a successful AI adoption.
In self-driving cars or other machinery automation activities we can already perceive the importance of synthetic data, so I would say that it’s only natural that this understanding spreads across all industry verticals.
How does Ydata generate synthetic data?
YData leverages on mainly Deep Generative models in order to learn the statistical attributes and correlations between variables of the original data. This allows the model to generate a statistically relevant dataset that has the same business value of the original one, without allowing traceability to the original records.
YData is pushing this technology forward and is the company behind the Synthetic Data Community – a group of data science experts committed to evangelize and help anyone that wants to learn and use this technology.
How does the Ydata platform help to discover and unlock new data sources?
YData’s platform includes built-in connectors to any type of database, data warehouse or data lake, that allows users to easily get access to relevant metadata and understand if the existing data is useful to answer the business question they have at hands – without even looking at the real records.
Could you share some details regarding the Synthetic Data Open Source community?
Synthetic data is only in its early days and for that reason the awareness of how it is generated, the benefits or its limitations are still somewhat unknown for a larger audience. For that reason, at YData we have decided to take a more educational route by creating the Synthetic Data community – besides being a place to exchange ideas or to get help from experts in the synthetic data field, it is also a place where data scientists and other tech profiles can start their journey into synthetic data, with some of the most interesting algorithms from the literature.
Furthermore, we also offer a perspective on data quality, so data scientists can first understand the data they’re working with, before synthesizing or improving data synthetization. We are truly committed in helping data teams to become more and more data-centric.
YData recently announced $2.7m in funding to fast track its international expansion. Can you share some details regarding what this means for the future of the company and its expansion strategy?
YData was born international already – we knew that this kind of technology needs early adopters that are usually in the most sophisticated countries. For that reason, our first customers were already outside Portugal, all over Europe and we’re now establishing a presence in North America as well. This funding will allow us to strengthen our presence in both of these continents, not only commercially but also to grow the team: we’re a fully distributed team which allows us to hire the best talent, wherever they are.
Is there anything else that you would like to share about YData?
YData is pushing the barrier of data-centric AI and creating a new category: DataPrepOps – although it’s an ugly name, it’s a pain most companies are facing nowadays when it comes to data science development. The Data Quality trend continues to grow and after Data Pipelines and Data Observability, Data Quality for Data Science teams is still in its infancy and YData is emerging as a thought leader in data preparation.
Thank you for the great interview, readers who wish to learn more should visit YData.
Credit: Source link