Tech Threads

How Google news able to group similar news together

Google news uses clustering machine learning technique to group similar kind of news or articles together.  Interestingly, they don’t have thousand news editors on trunk instead use the clustering technique to forms groups of similar data based on the common characteristics. Mahout is a machine learning software from Apache community that applications leverage to analyse large sets of data.  Before invention of Mahout, it was too complex to a analyse large sets of data. Mahout extensively utilize Apache Hadoop to get the power parallel and distributed computing. Three machine learning techniques that offers by Mahout.

  • Recommendation
  • Classification
  • Clustering

Clustering is not group data into an existing set of known categories.  This is particularly useful when we are not sure how to organize the data in the first place.  Google news uses this powerful technique to make change the ever-changing stream of news and articles from around the world to enabling us keep update with latest events around the globe.

Recommendation technique uses user information + community information to deliver what type of product/service etc we would like or prefer mainly when we are browsing e-commerce sites.

Page: 1 2

Recent Posts

The Role of Materialized Views in Modern Data Stream Processing Architectures + RisingWave

Incremental computation in data streaming means updating results as fresh data comes in, without redoing… Read More

1 month ago

Unlocking the Power of Patterns in Event Stream Processing (ESP): The Critical Role of Apache Flink’s FlinkCEP Library

We call this an event when a button is pressed, a sensor detects a temperature… Read More

2 months ago

Real-Time Redefined: Apache Flink and Apache Paimon Influence Data Streaming’s Future

Apache Paimon is made to function well with constantly flowing data, which is typical of… Read More

2 months ago

Revolutionize Stream Processing with the Power of Data Fabric

A data fabric is an innovative system designed to seamlessly integrate and organize data from… Read More

3 months ago

Bridging the Gap: Unlocking the Power of HDFS-Based Data Lakes with Streaming Databases

Big data technologies' quick development has brought attention to the necessity of a smooth transition… Read More

3 months ago

Which Flow Is Best for Your Data Needs: Time Series vs. Streaming Databases

Data is being generated from various sources, including electronic devices, machines, and social media, across… Read More

4 months ago